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variables  given  evidence  data  about  the  input  variables.  In  addition,  these 
problems  usually  involve  data  that  have  thousands  of  examples.  Thus,  it  is 
important  to  develop  new  discriminative  learning  methods  for  MLNs  that  are 
more  accurate  and  more  scalable,  which  are  the  topics  addressed  in  this  thesis. 

First,  we  present  a  new  method  that  discriminativcly  learns  both  the 
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are  non- recursive  ones.  Non-recursive  clauses  arise  in  many  learning  problems 
in  Inductive  Logic  Programming.  To  further  improve  the  predictive  accuracy, 
we  propose  a  max-margin  approach  to  learning  weights  for  MLNs.  Then, 
to  address  the  issue  of  scalability,  we  present  CDA,  an  online  max-margin 
weight  learning  algorithm  for  MLNs.  Ater  that,  we  present  OSL,  the  first 
algorithm  that  performs  both  online  structure  learning  and  parameter  learning. 
Finally,  we  address  an  issue  arising  in  applying  MLNs  to  many  real- world 
problems:  learning  in  the  presence  of  many  hard  constraints.  Including  hard 
constraints  during  training  greatly  increases  the  computational  complexity  of 
the  learning  problem.  Thus,  we  propose  a  simple  heuristic  for  selecting  which 
hard  constraints  to  include  during  training. 

Experimental  results  on  several  real-world  problems  show  that  the  pro¬ 
posed  methods  are  more  accurate,  more  scalable  (can  handle  problems  with 
thousands  of  examples),  or  both  more  accurate  and  more  scalable  than  existing 
learning  methods  for  MLNs. 
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Chapter  1 


Introduction 


A  lot  of  data  in  the  real  world  are  in  the  form  of  structured/relational 
data  such  as  sequences,  graphs,  multi-relational  data,  etc.  These  structured 
data  contain  a  lot  of  entities  (or  objects)  and  relationships  among  the  entities. 
For  example,  biochemical  data  contain  information  about  various  atoms  and 
their  interactions,  social  network  data  contain  information  about  people  and 
relationships  between  them,  and  so  on.  Moreover,  there  are  a  lot  of  uncertain¬ 
ties  in  these  data:  uncertainty  about  the  attributes  of  an  object,  the  type  of  an 
object,  as  well  as  relationships  between  objects.  Statistical  relational  learning 
(SRL)  (Getoor  &  Taskar,  2007)  which  combines  ideas  from  rich  knowledge  rep¬ 
resentations,  such  as  first-order  logic,  with  those  from  probabilistic  graphical 
models  is  an  emerging  area  of  research  that  addresses  the  problem  of  learning 
from  these  noisy  structured/relational  data, 

A  variety  of  different  SRL  models  have  been  proposed  in  the  last 
two  decades.  Among  them,  Markov  Logic  Networks  (MLNs)  (Richardson  & 
Domingos,  2006;  Domingos  &  Lowd,  2009),  sets  of  weighted  first-order  logic 
formulae,  are  an  elegant  but  powerful  formalism.  It  generalizes  both  first- 
order  logic  and  Markov  networks.  MLNs  are  capable  of  representing  all  pos- 
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sible  probability  distributions  over  a  finite  number  of  objects  (Richardson  & 
Domingos,  2006).  Moreover,  MLNs  also  subsume  other  SRL  representations 
such  as  Probabilistic  Relational  Models  (Kollcr  &  Pfeffer,  1998)  and  Rela¬ 
tional  Markov  Networks  (Taskar,  Abbeel,  &  Kollcr,  2002).  MLNs  have  been 
successfully  applied  to  a  variety  of  real-world  problems  ranging  from  extrac¬ 
tion  knowledge  from  text  (Kok  &  Domingos,  2008)  to  visual  event  recognition 
(Tran  &  Davis,  2008).  Therefore,  in  this  thesis,  we  have  chosen  MLNs  as  the 
model  for  doing  research. 

Currently,  most  of  the  existing  learning  algorithms  for  MLNs  are  in  the 
generative  setting:  they  try  to  learn  a  model  that  is  equally  capable  of  pre¬ 
dicting  the  values  of  all  variables  given  an  arbitrary  set  of  evidence.  However, 
most  of  the  learning  problems  in  relational  data  are  discriminative — where  the 
variables  are  divided  into  two  disjoint  sets  input  and  output,  and  the  goal  is  to 
correctly  predict  the  values  of  the  output  variables  given  evidence  data  about 
the  input  ones.  For  example,  in  many  problems  in  biochemistry,  the  goal 
is  to  learn  a  model  that  discriminates  the  active  chemical  compounds  from 
the  inactive  ones  based  on  their  molecular  structures.  This  task  is  called  the 
structure  activity  relationship  prediction,  and  it  is  an  important  task  in  drug 
design  and  discovery  (King,  Sternberg,  &  Srinivasan,  1995).  Another  example 
is  structured  prediction  problems  (Bakir,  Hoffman,  Scholkopf,  Smola,  Taskar, 
&  Vishwanathan,  2007)  where  the  output  variables  are  interdependent.  For 
example,  in  field  segmentation  (Grenager,  Klein,  &  Manning,  2005),  one  is 
given  a  text  document  represented  as  a  sequence  of  tokens  and  the  goal  is  to 
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segment  the  document  into  fields  (i.e.  to  label  each  token  in  the  document 
with  a  field  label).  Note  that,  there  are  dependencies  between  tokens’  labels 
such  as  consecutive  tokens  usually  have  the  same  field  label.  It  is,  therefore, 
an  important  research  problem  to  develop  discriminative  learning  methods  for 
MLNs  that  have  high  predictive  accuracies  on  these  discriminative  tasks. 

On  the  other  hand,  all  existing  methods  for  learning  the  structure  (i.e. 
logical  clauses)  of  an  MLN  (Kok  &  Domingos,  2005;  Mihalkova  &  Mooney, 
2007;  Biba,  Ferilli,  &  Esposito,  2008;  Kok  &  Domingos,  2009,  2010)  are  batch 
algorithms  that  are  effectively  designed  for  training  data  with  relatively  few 
mega-examples  (Mihalkova,  Huynh,  &  Mooney,  2007).  A  mega-example  is  a 
large  set  of  connected  facts,  and  mega-examples  are  disconnected  and  inde¬ 
pendent  from  each  other.  For  instance,  in  WebKB  (Slattery  &  Craven,  1998), 
there  are  four  mega-examples,  each  of  which  contains  data  about  a  particular 
university’s  computer-science  department’s  web  pages  of  professors,  students, 
courses,  research  projects  and  the  hyperlinks  between  them.  Previous  work 
has  found  that  there  are  a  lot  of  repeated  patterns  in  each  mega-example  and 
exploited  this  characteristic  to  develop  efficient  structure  learning  methods  for 
MLNs  on  those  problems  (Kok  &  Domingos,  2009,  2010).  However,  there  are 
many  real-world  problems  with  a  different  character  —  involving  data  with 
thousands  of  smaller  structured  examples.  For  example,  a  standard  dataset 
for  semantic  role  labeling  consists  of  90,  750  training  examples  where  each  ex¬ 
ample  is  a  verb  and  all  of  its  semantic  arguments  in  a  sentence  (Carreras  & 
Marquez,  2005).  In  addition,  each  example  does  not  contain  a  lot  of  repeated 
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patterns,  but  the  patterns  are  repeated  across  examples.  On  the  other  hand, 
most  existing  methods  for  learning  the  parameters  (i.e.  clauses’  weights)  of  an 
MLN  employs  batch  training  where  the  learner  must  repeatedly  run  inference 
over  all  training  examples  in  each  iteration,  which  becomes  computationally 
expensive  on  datasets  with  thousands  of  training  examples.  Thus,  it  is  neces¬ 
sary  to  develop  new  discriminative  learning  methods  for  MLNs  that  can  handle 
data  with  a  large  number  of  examples. 

1.1  Thesis  Contributions 

This  thesis  addresses  two  important  issues  in  discriminative  learning 
for  MLNs:  accuracy  and  scalability.  We  presents  new  discriminative  learning 
methods  that  are  more  accurate,  more  scalable  (can  handle  problems  with 
thousands  of  examples),  or  both. 

First,  we  describe  a  new  method  that  discriminatively  learns  both  the 
structure  and  parameters  for  a  special  class  of  MLNs  where  all  the  clauses 
are  non-recursive  ones  which  arise  in  many  benchmark  problems  in  Inductive 
Logic  Programming  (ILP).  Most  existing  learning  methods  for  non- recursive 
clauses  in  ILP  are  purely  logical  approaches,  which  cannot  handle  uncertainty. 
So,  the  idea  is  to  use  those  ILP  methods  to  construct  a  large  number  of  poten¬ 
tially  useful  clauses,  and  then  use  h -regularized  parameter  learning  methods 
to  properly  weight  them,  preferring  to  assign  zero  weights  to  clauses  that  do 
not  contribute  significantly  to  overall  predictive  accuracy,  thereby  eliminating 
them.  The  proposed  approach  outperforms  existing  ones  in  term  of  predictive 
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accuracy  and  achieves  state-of-the-art  results  on  a  benchmark  problem  in  drug 
design. 

Second,  existing  discriminative  methods  for  learning  the  weights  of  an 
MLN  attempt  to  maximize  the  conditional  log  likelihood,  which  is  suitable 
when  the  goal  is  to  predict  accurate  probabilities.  However,  in  many  applica¬ 
tions,  the  actual  goal  is  to  maximize  an  alternative  performance  metric  such 
as  classification  accuracy  or  F-measure.  Max-margin  methods  are  a  compet¬ 
ing  approach  to  discriminative  training  that  are  well-founded  in  computational 
learning  theory  and  have  demonstrated  empirical  success  in  many  applications 
(Cristianini  &  Shawe- Taylor,  2000).  They  also  have  the  advantage  that  they 
can  be  adapted  to  maximize  a  variety  of  performance  metrics  in  addition  to 
classification  accuracy  (Joachims,  2005).  Thus,  we  present  a  max-margin  ap¬ 
proach  to  learning  weights  for  an  MLN.  We  show  how  to  formulate  the  weight 
learning  problem  for  MLNs  as  a  max-margin  optimization  problem.  In  order 
to  solve  the  optimization  problem,  we  develop  a  new  approximate  inference 
algorithm  for  MLNs  based  on  Linear  Programming  relaxation.  Experimental 
results  on  several  problems  show  that  our  max-margin  weight  learner  gener¬ 
ally  has  better  and  more  stable  predictive  accuracy  than  the  previously  best 
discriminative  MLN  weight  learner. 

However,  like  other  existing  weight  learners  for  MLNs,  the  above  max- 
margin  weight  learner  does  not  scale  to  problems  with  thousands  of  training  ex¬ 
amples.  To  address  this  issue,  we  develop  CDA,  an  online  max-margin  weight 
learning  algorithm  for  structured  prediction,  and  apply  it  to  learn  weights  for 
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an  MLN.  The  algorithm  is  derived  from  the  primal-dual  framework  (Kakade  & 
Shalev-Shwartz,  2009),  a  general  framework  for  deriving  online  algorithms  that 
have  low  regret.  Since  CDA  processes  one  example  at  a  time,  it  can  handle 
problems  with  thousands  of  training  examples  where  existing  batch  learning 
methods  for  MLNs  cannot.  On  the  other  hand,  CDA  generally  achieves  better 
accuracy  than  existing  online  methods  for  structured  prediction. 

The  above  CDA  online  algorithm  only  updates  the  parameters  of  an 
input  MLN  and  assumes  the  structure  of  the  input  MLN  is  complete  or  perfect. 
Nevertheless,  it  is  usually  impossible  or  infeasible  to  specify  a  complete  or 
perfect  model’s  structure  at  the  beginning.  So  it  would  be  useful  to  have 
an  algorithm  that  enhances  the  model’s  initial  structure  along  with  updating 
the  model’s  parameters.  Therefore,  we  present  OSL,  the  first  algorithm  that 
performs  both  online  structure  and  parameter  learning.  At  each  step,  based 
on  the  model’s  wrong  predictions,  OSL  finds  new  clauses  that  fix  these  errors, 
then  uses  an  adaptive  subgradient  method  with  ^-regularization  to  update 
weights  for  both  old  and  new  clauses.  Experimental  results  on  two  real-world 
datasets  show  that  OSL  outperforms  systems  that  only  do  online  parameter 
learning.  In  addition,  OSL  also  performs  well  when  starting  from  scratch  (i.e. 
no  input  structure). 

Finally,  we  address  an  issue  arising  in  applying  MLNs  to  many  real- 
world  problems:  learning  in  the  presence  of  many  hard  constraints.  Including 
hard  constraints  during  training  greatly  increases  the  computational  complex¬ 
ity  of  the  learning  problem.  Thus,  we  propose  a  simple  heuristic  for  selecting 
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which  hard  constraints  to  include  during  training.  Experimental  results  on  the 
task  of  bibliographic  citation  segmentation  show  that  the  proposed  approach 
achieves  the  best  predictive  accuracy  while  still  allowing  for  efficient  training. 


1.2  Thesis  Outline 

The  remainder  of  the  thesis  is  organized  as  follows. 

•  Chapter  2  reviews  our  terminology  and  notation,  and  presents  back¬ 
ground  on  MLNs,  max-margin  structured  prediction,  the  primal-dual 
framework,  and  some  standard  evaluation  metrics. 

•  Chapter  3  describes  our  discriminative  structure  and  parameter  learning 
algorithm  for  MLNs  with  non-recursive  clauses. 

•  Chapter  4  presents  the  max-margin  approach  to  learning  weights  for 
MLNs.  Chapter  5  discusses  our  online  max-margin  weight  learning  al¬ 
gorithm  for  MLNs. 

•  Chapter  6  describes  OSL,  an  online  algorithm  that  updates  both  the 
structure  and  parameters  of  an  MLN. 

•  Chapter  7  presents  our  work  on  learning  with  hard  constraints. 

•  Chapter  8  discusses  future  work  and  chapter  9  concludes  the  thesis. 

We  note  that  the  material  presented  in  Chapter  3  has  appeared  in  our  previous 
publication  (Huynh  &  Mooney,  2008),  the  material  in  Chapter  4  has  appeared 
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in  (Huynh  &  Mooney,  2009)  and  the  material  in  Chapter  5  has  appeared  in 
(Huynh  &  Mooney,  2011). 


Chapter  2 


Background 

2.1  Terminology  and  Notation 

There  are  four  types  of  symbols  in  first-order  logic:  constants,  variables, 
predicates,  and  functions  (Genesereth  &  Nilsson,  1987).  Here,  we  assume  that 
the  domains  contain  no  functions.  Constants  are  objects  in  the  domain  and  can 
have  types.  Variables  range  over  objects  in  the  domain.  Predicates  represent 
relations  in  the  domain.  Each  predicate  has  a  number  of  arguments.  Each 
argument  can  have  a  type  that  specifies  the  type  of  constant  that  can  be 
used  to  ground  it.  We  denote  constants  by  strings  starting  with  upper-case 
letters,  and  variables  by  strings  starting  with  lower-case  letters.  A  term  is  a 
constant  or  a  variable.  An  atom  is  a  predicate  applied  to  terms.  A  ground 
atom  is  an  atom  all  of  whose  arguments  are  constants.  A  positive  literal  is  an 
atom,  and  a  negative  literal  is  a  negated  atom.  A  ground  literal  is  a  literal 
containing  only  constants.  A  possible  world  is  an  assignment  of  truth  values 
to  all  ground  atoms  in  a  domain.  A  formula  consists  of  literals  connected  by 
logical  connectives  (i.e.  V  and  A).  A  formula  in  clausal  form,  also  called  a 
clause,  is  a  disjunction  of  literals.  A  ground  clause  is  a  clause  containing  only 
ground  literals.  A  clause  with  at  most  one  positive  literal  is  called  a  Horn 
clause.  A  Horn  clause  with  exactly  one  positive  literal  is  a  definite  clause. 
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For  mathematical  terms,  we  use  lower  case  letters  (e.g.  rj,  A)  to  denote 
scalars,  bold  face  letters  (e.g.  x ,  y ,  w)  to  denote  vectors,  and  upper  case 
letters  (e.g.  W,  X)  to  denote  sets.  The  inner  product  between  vectors  w  and 
x  is  denoted  by  (w,x).  The  [a]+  notation  denotes  a  truncated  function  at  0, 
i.e.  [a]  +  =  max  (a,  0) 

2.2  Inductive  Logic  Programming  and  Aleph 

Traditional  Inductive  Logic  Programming  (ILP)  systems  discrimina- 
tivcly  learn  logical  Horn-clause  rules  (logic  programs)  for  inferring  a  given 
target  predicate  given  information  provided  by  a  set  of  background  predi¬ 
cates.  These  purely  logical  definitions  are  induced  from  Horn-clause  back¬ 
ground  knowledge  and  a  set  of  positive  and  negative  tuples  of  the  target  pred¬ 
icate.  For  more  information  about  ILP,  please  see  (Dzeroski,  2007). 

Aleph  is  a  popular  and  effective  ILP  system  primarily  based  on  Pro- 
GOL  (Mugglcton,  1995).  The  basic  Aleph  algorithm  consists  of  four  steps. 
First,  it  selects  a  positive  example  to  serve  as  the  “seed”  example.  Then,  it  con¬ 
structs  the  most  specific  clause,  the  “bottom  clause” ,  that  entails  that  selected 
example.  The  bottom  clause  is  formed  by  conjoining  all  known  facts  about 
the  seed  example.  Next,  Aleph  finds  generalizations  of  this  bottom  clause  by 
performing  a  general  to  specific  search.  These  generalized  clauses  are  scored 
using  a  chosen  evaluation  metric,  and  the  clause  with  the  best  score  is  added 
to  the  final  theory.  This  process  is  repeated  until  it  finds  a  set  of  clauses  that 
covers  all  the  positive  examples.  Aleph  allows  users  to  customize  each  of 
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these  steps,  and  thereby  supports  a  variety  of  specific  algorithms. 


2.3  MLNs  and  Alchemy 


An  MLN  consists  of  a  set  of  weighted  first-order  logic  formulae.  It 
provides  a  way  of  softening  first-order  logic  by  making  situations  in  which 
not  all  formulae  are  satisfied  less  likely  but  not  impossible  (Richardson  & 
Domingos,  2006;  Domingos  &  Lowd,  2009).  More  formally,  let  X  be  the  set 
of  all  ground  atoms,  C  be  the  set  of  all  clauses  in  the  MLN,  tty  be  the  weight 
associated  with  clause  ct  e  C,  SCi  be  the  set  of  all  possible  groundings  of 
clause  Q.  Then  the  probability  of  a  possible  world  x  is  defined  as  (Richardson 
&  Domingos,  2006): 


P(X  =  x)  =  -  exp  ^  wi  9 


.CiSC  g£  Sc. 


=  b exp  (  w*ni(x) ) 

VciGC  / 


where  g(x)  is  1  if  g  is  satisfied  and  0  otherwise,  rq(x)  =  )T)  ^(x)  is  the  number 

g&Sct 

of  true  groundings  of  ct  in  the  possible  world  x,  and  Z  =  exP  (Sc  ee  wini(x)) 

xEX 

is  the  normalization  constant.  In  many  applications,  we  know  a  priori  which 
predicates  are  evidence  predicates  and  which  predicates  are  query  ones,  and 
the  goal  is  to  correctly  predict  query  atoms  given  evidence  atoms.  If  we  par¬ 
tition  the  ground  atoms  in  the  domain  into  a  set  of  evidence  atoms  X  and  a 
set  of  query  atoms  y,  the  conditional  probability  of  y  given  x  is: 

P(y  =  y|X  =  x)  =  ^r-  exp  ^^«yrq(x,y)j  (2.1) 
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where  n;(x,  y)  is  the  number  of  true  groundings  of  c%  in  the  possible  world 

(x,y)  and  Zx  =  exP  (Sc  y))  is  the  normalization  constant, 

yey 

There  are  two  main  inference  tasks  in  MLNs.  The  first  one  is  to  infer 
the  Most  Probable  Explanation  (MPE)  or  the  most  probable  truth  values  for  a 
set  of  unknown  literals  y  given  a  set  of  known  literals  x,  provided  as  evidence 
(also  called  MAP  inference  in  some  other  work).  This  task  is  formally  defined 
as  follows: 


argmaxP(y|x) 

y 


arg  max 
y 


arg  max 
y 


^WiTiifcy) 

CiEC 


MPE  inference  in  MLNs  is  therefore  equivalent  to  finding  the  truth  assign¬ 
ment  that  maximizes  the  sum  of  the  weights  of  satisfied  clauses,  a  Weighted 
MAX-SAT  problem.  This  is  an  NP-hard  problem  for  which  a  number  of  ap¬ 
proximate  solvers  exist,  of  which  the  most  commonly  used  is  MaxWalkSAT 
(Kautz,  Selrnan,  &  Jiang,  1997).  Recently,  Riedel  (2008)  proposed  a  more 
efficient  method  to  solve  the  MPE  inference  problem  called  Cutting  Plane  In¬ 
ference  (CPI),  which  does  not  require  grounding  the  whole  MLN.  The  CPI  is 
a  rneta  inference  algorithm  that  incrementally  constructs  some  parts  of  a  large 
and  complex  Markov  network  and  then  uses  some  MPE  inference  algorithm 
to  End  the  MPE  solution  on  the  constructed  network.  The  main  idea  is  that 
we  don’t  need  to  ground  the  whole  Markov  network  to  hnd  the  MPE  solution 
since  there  is  a  lot  of  redundant  information  in  the  whole  network.  However, 
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the  CPI  method  only  works  well  when  the  separation  step  returns  a  small  set 
of  constraints.  In  the  worst  case,  it  also  constructs  the  whole  ground  MLN. 

The  second  inference  task  in  MLNs  is  marginal  inference  whose  goal 
is  to  compute  the  marginal  probabilities  of  some  unknown  query  literals  y. 
Computing  these  probabilities  is  also  intractable,  but  there  are  good  approxi¬ 
mation  algorithms  such  as  MC-SAT  (Poon  &  Domingos,  2006)  and  lifted  belief 
propagation  (Singla  &  Domingos,  2008). 

Learning  an  MLN  consists  of  two  tasks:  structure  learning  and  weight 
learning.  The  weight  learner  can  learn  weights  for  clauses  written  by  a  hu¬ 
man  expert  or  automatically  induced  by  a  structure  learner.  There  are  two 
approaches  to  weight  learning  in  MLNs:  generative  and  discriminative.  In  dis¬ 
criminative  learning,  we  know  a  priori  which  predicates  will  be  used  to  supply 
evidence  and  which  ones  will  be  queried,  and  the  goal  is  to  correctly  predict  the 
latter  given  the  former.  Several  discriminative  weight  learning  methods  have 
been  proposed,  all  of  which  try  to  find  weights  that  maximize  the  Conditional 
Log  Likelihood  (CLL)  (equivalently,  minimize  the  negative  CLL).  In  MLNs, 
the  derivative  of  the  negative  CLL  with  respect  to  a  weight  wy  is  the  difference 
of  the  expected  number  of  true  groundings  Ew  [rq]  of  the  corresponding  clause 
fi  and  the  actual  number  according  to  the  data  nl .  However,  computing  the 
expected  count  Ew  [rtj]  is  intractable.  The  first  discriminative  weight  learner 
(Singla  &  Domingos,  2005)  uses  the  structured  perceptron  algorithm  (Collins, 
2002)  where  it  approximates  the  intractable  expected  counts  by  the  counts  in 
the  MPE  state  computed  by  the  MaxWalkSAT.  Later,  Lowd  and  Domingos 
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(2007)  presented  a  number  of  first-order  and  second-order  methods  for  opti¬ 
mizing  the  CLL.  These  methods  use  samples  from  MC-SAT  to  approximate 
the  expected  counts  used  to  compute  the  gradient  and  Hessian  of  the  CLL. 
Among  them,  the  best  performing  is  preconditioner  scaled  conjugate  gradient 
(PSCG)  (Lowd  &  Domingos,  2007).  This  method  uses  the  inverse  diagonal 
Hessian  as  the  preconditioner. 

Regarding  structure  learning,  there  are  currently  two  main  approaches 
for  learning  clauses  for  MLNs.  The  first  one  is  a  top-down  approach  (Kok  & 
Domingos,  2005;  Biba  et  ah,  2008).  These  algorithms  can  start  from  an  empty 
network  or  from  an  existing  knowledge  base.  So  they  can  be  used  for  learning  a 
new  MLN  or  revising  an  existing  MLN.  The  algorithms  usually  start  from  the 
set  of  unit  clauses,  and  iteratively  add  new  clauses  to  the  model.  In  each  step, 
they  try  to  find  the  best  clause  to  add  to  the  current  MLN  by  adding,  delet¬ 
ing,  or  flipping  the  sign  of  a  literal  (Kok  &  Domingos,  2005)  or  performing  a 
stochastic  local  search  (Biba  et  al.,  2008).  The  weight  of  each  candidate  clause 
is  set  to  optimize  the  weighted  pseudo  log-likelihood  (WPLL)  (Kok  &  Domin¬ 
gos,  2005)  through  an  optimization  procedure.  Then  each  candidate  structure 
is  scored  by  the  WPLL  (Kok  &  Domingos,  2005)  or  by  the  CLL  (Biba  et  al., 
2008),  and  the  best  candidate  clause  is  add  to  the  learnt  MLN.  The  other 
approach  is  the  bottom-up  one  (Mihalkova  &  Mooney,  2007;  Kok  &  Domin¬ 
gos,  2009,  2010).  Mihalkova  and  Mooney  (2007)  proposed  the  first  bottom-up 
structure  learner  for  MLNs  called  BUSL.  It  first  constructs  Markov  network 
templates  from  the  data  and  then  generates  candidate  clauses  from  these  net- 
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work  templates.  All  candidate  clauses  are  also  evaluated  using  WPLL,  and 
added  to  the  final  MLN  in  a  greedy  manner.  Later,  Kok  and  Domingos  (2009) 
proposed  a  new  bottom-up  structure  learner  for  MLNs  called  LHL.  The  main 
idea  of  this  algorithm  is  based  on  the  observation  that  a  relational  database 
can  be  viewed  as  a  hypergraph  with  constants  as  nodes  and  relations  as  hy¬ 
peredges.  Then  a  clause  can  be  constructed  from  a  path  in  the  hypergraph. 
However,  a  hypergraph  usually  contains  an  exponential  number  of  paths.  So 
to  make  it  tractable,  the  algorithm  first  lifts  the  hypergraph  by  jointly  cluster¬ 
ing  all  the  constants  in  the  relational  database  to  form  higher-level  concepts, 
then  finds  paths  in  the  lifted  hypergraph.  Recently,  Kok  and  Domingos  (2010) 
proposed  LSM,  another  bottom-up  MLN  structure  learner  that  can  learn  long 
clauses  (more  than  5  literals).  The  key  insight  of  LSM  is  that  relational  data 
usually  contain  repeated  patterns  of  densely  connected  objects  called  struc¬ 
tural  motifs.  By  limiting  the  search  to  each  unique  motif,  LSM  is  able  to  find 
good  clauses  in  an  efficient  manner. 

Alchemy  (Kok,  Singla,  Richardson,  &  Domingos,  2005)  is  an  open 
source  software  package  for  MLNs.  It  includes  implementations  for  all  of  the 
major  existing  algorithms  for  structure  learning,  generative  weight  learning, 
discriminative  weight  learning,  and  inference.  Our  proposed  algorithms  are 
implemented  using  Alchemy. 
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2.4  Max-margin  structured  prediction 

In  this  section,  we  briefly  review  the  max-margin  structured  prediction 
problem  and  an  algorithmic  schema  for  solving  it  efficiently.  For  more  detail, 
see  Tsochantaridis,  Joachims,  Hofmann,  and  Altun  (2005),  Joachims,  Finley, 
and  Yu  (2009).  In  structured  prediction,  we  want  to  learn  a  function  h  : 
X  — >  y,  where  X  is  the  space  of  inputs  and  y  is  the  space  of  multivariate  and 
structured  outputs,  from  a  set  of  training  examples  S : 

S  =  ((x1?  yi), ...,  (xn,  yn))  G  (X  x  y)n 

The  goal  is  to  find  a  function  h  that  has  low  prediction  error.  This  can  be 
accomplished  by  learning  a  discriminant  function  /  :  X  x  y  — >  R,  then  maxi¬ 
mizing  /  over  all  y  G  y  for  a  given  input  x  to  get  the  prediction: 

hw  ( x )  =  arg  max  /w  (x,  y ) 
yey 

The  discriminant  function  /w(x,  y)  takes  the  form  of  a  linear  function: 

/w(x,y)  =  wr0(x,y) 

where  w  G  Rn  is  a  parameter  vector  and  (p  is  a  feature  vector  relating  an  input 
x  and  output  y.  The  features  need  to  be  designed  for  a  given  problem  so  that 
they  capture  the  dependency  structure  of  y  and  x  and  the  relations  among 
the  outputs  y  .  Then,  the  goal  is  to  find  a  weight  vector  w  that  maximizes 
the  margin  which  is  the  difference  between  the  model’s  score  for  the  correct 
label  and  the  model’s  score  for  the  closest  incorrect  one: 

7 (x»,  y*;  w)  =  w T0(xj,  yt)  -  max  wJ  </>(x?:,  y') 

y'ey\y,: 
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The  max-margin  problem  above  can  be  formulated  as  an  optimization 
problem  called  structural  SVM  (Tsochantaridis,  Joachims,  Hofmann,  &  Altun, 
2004;  Tsochantaridis  et  ah,  2005)  as  follows: 


Optimization  Problem  1  (OP1):  Structural  SVM 


1 


nnn  -w  w 

w,£>0  2 


n 

-Ee 

n 

i=i 


s.t.  Vi,  Vy  G  y  \  Yi  :  wT[(/>(xi,yi )  -  </>(x;,y)]  >  1  -  & 


The  slack  variables  are  used  to  allow  some  errors  in  the  training  data, 
and  the  scalar  C  >  0  is  a  hyper-parameter  that  controls  the  trade-off  between 
minimizing  the  training  error  and  maximizing  the  margin.  This  formulation 
implicitly  imposes  a  zero-one  loss  on  each  constraint  which  is  inappropriate  for 
most  kinds  of  structured  output  since  it  treats  a  prediction  that  is  very  close 
to  the  correct  one  as  the  same  as  a  prediction  that  is  completely  different  from 
the  right  one.  To  take  into  account  this  problem,  Taskar,  Guestrin,  and  Kollcr 
(2004)  proposed  to  re-scale  the  margin  by  the  Hamming  loss  of  the  wrong 
label.  This  margin-rescaling  approach  also  works  for  other  loss  functions  as 
well  (Tsochantaridis  et  al.,  2005).  The  resulting  optimization  problem  is  as 
follows: 


Optimization  Problem  2  (OP2):  Structural  SVM  with 
Margin-Rescaling 

•  1  T 

nun  -w  w  H - >  G 

wg>o  2  n  ^ 

1=1 

s.t.  Vi,  Vy  G  V  :  wT[0(xi,yi)  -  </>(xi,  y)]  >  A(yi;y)  -  & 
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Note  that,  the  OP1  is  the  OP2  with  zero-one  label  loss.  Recently, 
Joachims  et  al.  (2009)  proposed  a  reformulation  of  the  above  optimization, 
called  “1-slack”  structural  SVMs  which  combines  all  training  examples  into 
one  big  training  example  and  has  only  slack  variable  for  the  new  mega  example: 


Optimization  Problem  3  (OP3):  1-Slack  Structural  SVM  with 

Margin-Rescaling 

l 

min  -w7w  +  C£ 

w,£>0  2 

Y  n  i  n 

s.t.  V(yi, ...,  yn)  G  T  ■  -WT  V[0(x,;,yi)  -  0(x*,yQ]  >  -  V  A(yi,yi)  -  £ 

n  z — '  n  z — ' 

i= 1  2—1 

The  1-slack  reformulation  leads  to  a  faster  and  more  scalable  training  algorithm 
whose  running  time  is  provably  linear  in  the  number  of  training  examples 
(Joachims  et  ah,  2009). 

In  each  iteration,  the  algorithm  2.1  solves  a  Quadratic  Programming 
(QP)  problem  (line  4)  to  find  the  optimal  weights  corresponding  to  the  cur¬ 
rent  set  of  constraints  W  and  a  separation  oracle  (line  6),  also  called  a  loss- 
augmented  inference  problem  (Taskar,  Chatalbashev,  Roller,  &  Guestrin,  2005), 
to  find  the  most  violated  constraint  to  add  to  W.  The  QP  problem  in  line  4  can 
be  solved  by  any  general  QP  solver.  In  contrast,  for  each  representation  (such 
as  Markov  networks  or  weighted  context  free  grammars)  a  specific  algorithm 
is  needed  for  solving  the  loss-augmented  inference  problem. 

To  enforce  a  sparse  solution  on  the  learned  weights,  we  can  replace  the 
square  2-norm,  w7  w,  on  these  formulations  by  the  1-norm,  ||w||i  =  ]G”=1  |uQ, 
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Algorithm  2.1  Cutting-plane  method  for  solving  the  “1-slack  structural 
SVMs”  (Joachims  et  al.,  2009) 

1:  Input:  S  =  ((xlt  yi), (xn,  y„)),  C,  e 
2:  W  <—  0 

3:  repeat 

4: 

(w,^)4-  min  -wTw  + 
w,{>o  2 

^  n  i  n 

s.t.  V(yi,  G  W  :  -wT  V[0(x4,  y4)  -  0(x,;,yi)]  >  -  V  A(yi,yi)  -  £ 

i=l  *=1 

5:  for  i  =  1  to  n  do 

6:  y i  <-  argmaxyey{A(yi;y)  +  wT</)(xi,y)} 

7:  end  for 

8:  W^Wu{(yi . yn)} 

n  n 

9:  until  ±  X)  A(Yg  y»)  -  ^wT  X  [0(*i,  y,:)  -  0(xi;  y*)]  <  £  +  £ 

i=l  i=l 

10:  return  (w,£) 


like  previous  work  on  1-norm  SVMs  (Bradley  &  Mangasarian,  1998;  Zhu,  Ros- 
set,  Hastie,  &  Tibshirani,  2003)  for  binary  classification.  Using  the  substitu¬ 
tion  Wi  =  wf—w^  and  |-uy|  =  wf  +w~  with  wf,  w~  >  0  (Fung  &  Mangasarian, 
2004),  we  can  cast  the  1-norm  minimization  problem  as  a  Linear  Programming 
(LP)  problem  and  use  the  algorithm  2.1  to  solve  the  LP  problem  by  replacing 
the  QP  problem  in  line  4  by  the  transformed  LP  problem.  A  special  case  of 
the  1-norm  structural  SVM  for  the  case  of  Markov  Networks  is  presented  in 
Zhu  and  Xing  (2009). 

In  summary,  to  apply  structural  SVMs  to  a  new  problem,  one  needs 
to  choose  a  representation  for  model,  design  a  corresponding  feature  vector 
function  0(x,  y),  select  a  label  loss  function  A(y,y),  and  design  algorithms  to 
solve  the  two  argmax  problems: 
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Prediction:  arg  rnaxyey  w2  </>(x,  y) 

Separation  Oracle:  arg  maXyey{A(y,  y)  +  w2  <f>(x,  y)} 

2.5  The  Primal-Dual  Algorithmic  Framework  for  On¬ 
line  Learning 

In  this  section,  we  briefly  review  the  primal-dual  framework  for  strongly 
convex  loss  functions  (Kakade  &  Shalev-Shwartz,  2009)  which  is  the  latest 
framework  for  deriving  online  algorithms  that  have  low  regret,  the  difference 
between  the  cumulative  loss  of  the  online  algorithm  and  the  cumulative  loss 
of  the  optimal  offline  solution.  Considering  the  following  primal  optimization 
problem: 

inf  Pt+\  (w)  =  inf 

wEW  wEW 

where  /  :  W  — >  R+  is  a  function  that  measures  the  complexity  of  the  weight 
vectors  in  W,  g.i  :  W  — »  R  is  a  loss  function,  and  a  is  non-negative  scalar.  For 
example,  if  W  =  Rd,  /( w)  =  §  1 1 w|  ||, 

and  gi( w)  =  ma xyey  [A(yt,y)  -  (w,  (</>(xt,  yt)  -  <j>(xt,  y))]+  then  the  above  opti¬ 
mization  problem  is  the  max-margin  structured  prediction  problem  described 
in  previous  section.  We  can  rewrite  the  optimization  problem  in  Eq.  2.1  as 
follows: 

inf 

Wo,Wi,...,wt 

s.t.  w0  G  W,  Vi  G  Wj  =  w0 


W0) 


X> 

1=1 


W  a 


w)  + 


W 


(2.1) 


i=  1 
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where  we  introduce  t  new  vectors  and  constrain  them  to  all  be  equal 

to  w0.  The  dual  of  this  problem  is: 

sup  A+i(Ai, A*) 

Ai,...,A  t 

=  sup  ~(crt)f*  I 

al-a  L  V  ti  )  ti  \ 

where  each  Af  is  a  vector  of  Lagrange  multipliers  for  the  equality  constraint 
wf  =  w0,  and  f*,gl,...,g *  are  the  Fenchel  conjugate  functions  of  f,gi,...,gt. 
A  Fenchel  conjugate  function  of  a  function  /  :  W  — ■>  R  is  defined  as  f*(6)  = 
supwevv((w^)  -  /(w)b  See  (Kakade  &  Shalev- Shwartz,  2009)  for  details  on 
the  steps  to  derive  the  dual  problem. 

From  the  weak  duality  theorem  (Boyd  &  Vandenberghe,  2004),  we  know 
that  the  dual  objective  is  upper  bounded  by  the  optimal  value  of  the  primal 
problem.  Thus,  if  an  online  algorithm  can  incrementally  ascend  the  dual 
objective  function  in  each  step,  then  its  performance  is  close  to  the  performance 
of  the  best  fixed  weight  vector  that  minimizes  the  primal  objective  function 
(the  best  offline  learner),  since  by  increasing  the  dual  objective,  the  algorithm 
moves  closer  to  the  optimal  primal  value. 

Based  on  this  observation,  Kakade  et.  al.  (Kakade  &  Shalev- Shwartz, 
2009)  proposed  the  general  online  incremental  dual  ascent  algorithm  (Algo¬ 
rithm  2.2),  where  dgt(vst)  =  {A  :  Vw  e  W,gt(w)  —  gt(wt)  >  (A,  (w  —  W())}  is  the 
set  of  subgradients  of  gt  at  wt.  The  condition  3 A'  e  dgt  s.t.  A+i(A*+1  )  A)+1)  > 

Dt+i(\\, ...,  Aj_l5  A')  ensures  the  dual  objective  is  increased  in  each  step.  The 
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Algorithm  2.2  A  general  incremental  dual  ascent  algorithm  for  u-strongly  convex 
loss  function  (Kakade  &  Shalev-Shwartz,  2009) 

Input:  A  strongly  convex  function  /,  a  positive  scalar  a 

for  t  =  1  to  T  do 

Set:  wf  =  Vf 

Receive:  lt( wt)  =  af( wt)  +  gt(wt) 

Choose  (A*+1, ...,  A(+1)  that  satisfy  the  condition: 

3 A'  g  dgt{ wt)  s.t.  A+1(At1+1,...,A*+1)  >  A+r(Ai, A*_i,  A') 

end  for 

regret  of  any  algorithm  derived  from  Algorithm  2.2  is  0(log  T)  (Kakade  & 
Shalev-Shwartz,  2009),  where  T  is  the  number  of  examples  seen  so  far. 

A  simple  update  rule  that  satisfies  the  condition  in  Algorithm  2.2  is 
to  find  a  subgradient  X!  e  dgt( wt)  and  set  A^+1  =  A7  and  keep  all  other  Aj’s 
unchanged  (i.e.  A(+1  =  A*,  Vi  <  t).  However,  the  gain  in  the  dual  objective 
for  this  simple  update  rule  is  minimal.  To  achieve  the  largest  gain  in  the  dual 
objective,  one  can  optimize  all  the  Aj’s  at  each  step.  But  this  approach  is 
usually  computationally  prohibitive  to  use  since  at  each  step,  we  need  to  solve 
a  large  optimization  problem: 

(Ai+1, ...,  A*+1)  G  argmax  Dt+1(\u  ...,  At) 

Ai . At 

A  compromise  approach  is  to  fully  optimize  the  dual  objective  function  at  each 
time  step  t  but  only  with  respect  to  the  last  variable  Xt: 

^i+i  _  f  \  if  *  <  t 

\  arg  maxAt  Dt+1  ( X[ , . . . ,  A*^ ,  At)  if  i  =  t 

This  is  called  the  Coordinate-Dual- Ascent  (CDA)  update  rule.  If  we  can  find 
a  closed-form  solution  of  the  optimization  problem  with  respect  to  the  last 
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variable  At,  then  the  computational  complexity  of  the  CDA  update  is  similar 
to  the  simple  update  but  the  gain  in  the  dual  objective  function  is  larger. 
Previous  work  ( Shale v- Shwartz  &  Singer,  2007b)  showed  that  algorithms  which 
more  aggressively  ascend  the  dual  function  have  better  performance. 


2.6  Evaluation  Metrics 

In  this  section,  we  briefly  review  some  standard  metrics  for  evaluating 
the  predictions  produced  by  a  model.  For  MLNs,  all  the  query  literals  are 
binary  (i.e  either  true  or  false).  So,  there  are  only  four  outcomes  which  are 
shown  in  Table  2.1: 


Table  2.1:  Confusion  matrix 


Actual  Values 

True 

False 

Predicted  Values 

True 

True  Positive  (TP) 

False  Positive  (FP) 

False 

False  Negative  (FN) 

True  Negative  (TN) 

Below  are  definitions  of  some  standard  evaluation  metrics: 


Accuracy:  the  proportion  of  corrected  predictions. 

TP  +  TN 

AcCU1'aCy  =  TP  +  FP  +  TN  +  FN 

Precision:  the  proportion  of  correted  true  predictions. 

Tp 

Precision  = 


TP  +  FP 
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Recall:  the  proportion  of  true  literals  that  are  correcly  predicted. 


Precision 


TP 

TP  +  FN 


•  F\ :  the  harmonic  mean  of  precision  and  recall. 


F\ 


2 


Precision  •  Recall 
Precision  +  Recall 


24 


Chapter  3 


Discriminative  Structure  and  Weight  Learning 
for  MLNs  with  Non-recursive  Clauses 

3.1  Introduction 

In  this  chapter,  we  look  at  a  special  class  of  MLNs  where  all  the  clauses 
are  non-recursive  clauses  which  contain  only  one  non-evidence  literal.  Non¬ 
recursive  clauses  arise  in  many  learning  problems  in  ILP  such  as  the  struc¬ 
ture  activity  relationship  prediction  (SAR)  problem  mentioned  in  chapter  1. 
For  those  problems,  there  is  a  specific  target  predicate  that  must  be  inferred 
given  evidence  data  about  other  background  predicates  used  to  describe  the 
input  data.  We  have  found  that  existing  structure  learning  algorithms  for 
MLNs  (Kok  &  Domingos,  2005;  Mihalkova  &  Mooney,  2007)  perform  very 
poorly  when  tested  on  several  benchmark  ILP  problems  since  they  are  non- 
discriminative. 

Thus,  we  present  a  new  method  that  discriminatively  learns  both  the 
structure  and  parameter  for  MLN  with  non-recursive  clauses.  The  proposed 
approach  first  uses  an  off-the-shelf  ILP  system  to  generate  a  large  set  of  good 
candidate  clauses,  then  utilizes  l \ -regularization  with  exact  inference  to  learn 
weights  for  those  candidate  clauses. 
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The  remainder  of  the  chapter  is  organized  as  follows.  Section  3.2 
presents  the  proposed  method.  Section  3.3  reports  experimental  evaluation. 
Section  3.4  discusses  related  work  and  section  3.5  summarizes  the  chapter. 

3.2  The  Proposed  Method 

3.2.1  Discriminative  Structure  Learning 

Ideally,  the  search  for  discriminative  MLN  clauses  would  be  directly 
guided  by  the  goal  of  maximizing  their  contribution  to  the  predictive  accuracy 
of  a  complete  MLN.  However,  this  would  require  evaluating  every  proposed 
refinement  to  the  existing  set  of  learned  clauses  by  relearning  weights  for  all  of 
the  clauses  and  performing  full  probabilistic  inference  to  determine  the  score  of 
the  revised  model.  This  process  is  computationally  expensive  and  would  have 
to  be  repeated  for  each  of  the  combinatorially  large  number  of  potential  clause 
refinements.  Evaluating  clauses  in  standard  ILP  is  quicker  since  each  clause 
can  be  evaluated  in  isolation  based  on  the  accuracy  of  its  logical  inferences 
about  the  target  predicate.  Consequently,  we  take  the  heuristic  approach  of 
using  a  standard  ILP  method  to  generate  clauses;  however,  since  the  logical 
accuracy  of  a  clause  is  only  a  rough  approximation  of  its  value  in  a  final 
MLN,  we  generate  a  large  number  of  candidates  whose  accuracy  is  at  least 
markedly  greater  than  random  guessing  and  allow  subsequent  weight  learning 
to  determine  their  value  to  an  overall  MLN. 

In  order  to  find  a  set  of  potentially  good  clauses  for  an  MLN,  we  use 
a  particular  configuration  of  Aleph.  Specifically,  we  use  the  induce_cover 
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command  and  m-estimate  evaluation  function.  The  induce  cover  command 
implements  a  variant  of  Progol’s  MDIE  greedy  covering  algorithm  (Muggle- 
ton,  1995)  which  does  not  remove  previously  covered  examples  when  scoring 
a  new  clause.  The  normal  Aleph  induce  command  scores  a  clause  based 
only  on  its  coverage  of  currently  uncovered  positive  examples.  However,  this 
scoring  is  not  reflective  of  its  use  in  a  final  MLN,  and  we  found  that  the 
induce_cover  approach  produces  a  larger  set  of  more  useful  clauses  that  sig¬ 
nificantly  increases  the  accuracy  of  our  final  learned  MLN.  The  m-estimate 
(Dzeroski,  1991)  is  a  Bayesian  estimation  of  the  accuracy  of  a  clause  (Cussens, 
2007).  The  m  parameter  defining  the  underlying  prior  distribution  is  automat¬ 
ically  set  to  the  maximum  likelihood  estimate  of  its  best  value.  The  output  of 
induce  cover  is  a  theory,  a  set  of  high-scoring  clauses  that  cover  all  the  pos¬ 
itive  examples.  However,  these  clauses  were  selected  based  on  an  m-estimate 
of  their  accuracy  under  a  purely  logical  interpretation,  and  may  not  be  the 
best  ones  for  an  MLN.  Therefore,  in  addition  to  these  clauses,  we  also  save  all 
generated  clauses  whose  m-estimate  is  greater  than  a  predefined  threshold  (set 
to  0.6  in  our  experiments).  This  provides  a  large  set  of  clauses  of  potential 
utility  for  an  MLN.  We  use  the  name  ALEPH+-1-  to  refer  to  this  version  of 
Aleph. 

3.2.2  Discriminative  Weight  Learning 

Compared  to  Alchemy’s  previously  best  discriminative  weight  learn¬ 
ing  method  (Lowd  &  Domingos,  2007),  our  method  embodies  two  important 
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modifications:  exact  inference  and  li-regularization.  This  section  describes 
these  two  modifications. 


First,  given  the  restricted  nature  of  the  clauses  constructed  by  Aleph, 
we  can  use  an  efficient  exact  probabilistic  inference  method  when  learning  the 
weights  instead  of  the  approximate  inference  algorithm  that  is  used  to  handle 
the  general  case.  Since  these  clauses  are  non-recursive  clauses  in  which  the 
target  predicate  only  appears  once,  a  grounding  of  any  clause  will  contain 
only  one  grounding  of  the  target  predicate.  For  MLNs,  this  means  that  the 
Markov  blanket  of  a  query  atom  only  contains  evidence  atoms.  Consequently, 
the  query  atoms  are  independent  given  the  evidence.  Let  Y  be  the  set  of  query 
atoms  and  X  be  the  set  of  evidence  atoms,  the  conditional  log  likelihood  of  Y 
given  A"  in  this  case  is: 

n 

log  P(Y  =  y\X  =  x)  =  lognm  =  Vi\X  =  x) 

3= 1 
n 

=  5>gP(V  =  a|A'  =  i) 

3= 1 

and, 


P{Yj  =  yfx  =  x)  = 

exP(J2ieeYj  wini(x,  yiY=y3])) 
exp{  Wini(x,y[Yj=0]))  +  exp(  Wini(x,y[Yj= q)) 

where  GYj  is  the  set  of  all  MLN  clauses  with  at  least  one  grounding  containing 
the  query  atom  Yj,  rq(x,  y\Y:i=y:j])  is  the  number  groundings  of  the  All  clause 
that  evaluate  to  true  when  all  the  evidence  atoms  in  X  and  the  query  atom  Yj 


28 


are  set  to  their  truth  values,  and  similarly  for  rii(x,  y[Yj=o])  and  n^x,  y[Yj=i}) 
when  Yj  is  set  to  0  and  1  respectively.  Then  the  gradient  of  the  CLL  is: 


£-\ogP(Y  =  y\X  =  x)  = 

n 

^[n*(z,?/[y.=%.])  -  P(Yj  =  0\X  =  x)ni(x,y[Yj=0]) 

3= 1 

-p(Xj  =  Mx  =  x)ni(x,y[Yj= 1])] 

Notice  that  the  sum  of  the  last  two  terms  in  the  gradient  is  the  expected 
count  of  the  number  of  true  groundings  of  the  i’th  formula.  In  general,  com¬ 
puting  this  expected  count  requires  performing  approximate  inference  under 
the  model.  For  example,  Singla  and  Domingos  (2005)  ran  MPE  inference  and 
used  the  counts  in  the  MPE  state  to  approximate  the  expected  counts.  How¬ 
ever,  in  our  case,  using  the  standard  closed  world  assumption  for  evidence 
predicates,  all  the  nH s  can  be  computed  without  approximate  inference  since 
there  is  no  ground  atom  whose  truth  value  is  unknown.  This  is  a  result  of 
restricting  the  structure  learner  to  non- recursive  clauses.  In  fact,  this  result 
still  holds  even  when  the  clauses  are  not  Horn  clauses.  The  only  restriction  is 
that  the  target  predicates  appear  only  once  in  every  clause.  Note  that  given 
a  set  of  weights,  computing  the  conditional  probability  P(y\x),  the  CLL,  and 
its  gradient  requires  only  the  nt  counts.  So,  in  our  case,  the  conditional  prob¬ 
ability  P(Yj  =  Uj\X  =  x),  the  CLL,  and  its  gradient  can  be  computed  exactly. 
In  addition,  these  counts  only  need  to  be  computed  once,  and  Alchemy  pro¬ 
vides  an  efficient  method  for  computing  them.  ALCHEMY  also  provides  an 
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efficient  way  to  construct  the  Markov  blanket  of  a  query  atom,  in  particular  it 
ignores  all  ground  formulae  whose  truth  values  are  unaffected  by  the  value  of 
the  query  atom.  In  our  case,  this  helps  reduce  the  size  of  the  Markov  blanket 
of  a  query  atom  significantly  since  many  ground  clauses  are  satisfied  by  the 
evidence.  As  a  result,  our  exact  inference  is  very  fast  even  when  the  MLN 
contains  thousands  of  clauses. 

Given  a  procedure  for  computing  the  CLL  and  its  gradient,  standard 
gradient-based  optimization  methods  can  be  used  to  find  a  set  of  weights 
that  optimizes  the  CLL.  However,  to  prevent  overfitting  and  select  only  the 
best  clauses,  we  follow  the  approach  suggested  by  Lee,  Ganapathi,  and  Kollcr 
(2007)  and  introduce  a  Laplacian  prior  with  zero  mean,  P(wy )  =  ((3/ 2)  • 
exp(—/3\wi\),  on  each  weight,  and  then  optimize  the  posterior  conditional  log 
likehood  instead  of  the  CLL.  The  final  objective  function  is: 

log  P(Y\X)P(w)  =  logP(E|X)  +  log  P(in) 

=  logP(F|X)  +  log(J]>H)) 

i 

=  CLL  +  log  (f  '  e%P(-P\wi\)) 

i 

=  CLL  -  (3  E  H|  +  constant 

i 

There  is  now  an  additional  term  [3  HI  in  the  objective  function,  which 
penalizes  each  non-zero  weight  Wj  by  (3\wi |.  So,  the  larger  f3  is  (corresponding 
to  a  smaller  variance  of  the  prior  distribution),  the  more  we  penalize  non-zero 
weights.  Therefore,  placing  a  Laplacian  prior  with  zero  mean  on  each  weight 
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is  equivalent  to  performing  an  / 1 -regularization  of  the  parameters.  An  impor¬ 
tant  property  of  A-regularization  is  its  tendency  to  force  parameters  to  zero 
by  strongly  penalizing  small  terms  (Lee  et  al.,  2007).  In  order  to  learn  weights 
that  optimize  the  /^-regularized  CLL,  we  use  the  OWL-QN  package  which 
implements  the  Orthant-Wise  Limited-memory  Quasi-Newton  algorithm  (An¬ 
drew  &  Gao,  2007). 

This  approach  to  preventing  over-fitting  contrasts  with  the  standard 
/2-regularization  used  in  previous  work  on  learning  weights  for  MLNs,  which 
is  equivalent  to  assuming  a  Guassian  prior  with  zero  mean  on  each  weight  and 
does  not  penalize  non-zero  weights  as  severely.  Since  Aleph+-|-  generates  a 
very  large  number  of  potential  clauses,  / ] -regularization  encourages  eliminat¬ 
ing  the  less  useful  ones  by  setting  their  weights  to  zero.  In  agreement  with 
prior  results  on  /] -regularization  (Ng,  2004;  Duclfk,  Phillips,  &  Schapire,  2007), 
our  experiments  confirm  that  it  results  in  simpler  and  more  accurate  learned 
models  compared  to  /2-regularization. 

3.3  Experimental  Evaluation 

In  this  section,  we  present  experiments  that  were  designed  to  answer 
the  following  questions: 

1.  How  does  our  method  compare  to  existing  methods,  specifically: 

(a)  Extant  learning  methods  for  MLNs. 

(b)  Traditional  ILP  methods,  viz.  Aleph. 
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(c)  “Advanced1'  ILP  methods,  viz.  kFOIL  (Landwehr,  Passerini,  Raedt, 
&  Frasconi,  2006),  tFOIL  (Landwehr,  Kersting,  &  Raedt,  2007), 
and  Rumble  (Ruckert  &  Kramer,  2007). 

2.  How  does  each  of  our  system’s  major  novel  components  below  contribute 
to  its  performance: 

(a)  Generation  of  a  larger  set  of  potential  clauses  by  using  Aleph+  + 
instead  of  Aleph. 

(b)  Exact  MLN  inference  for  non-recursive  clauses  instead  of  general 
approximate  inference. 

(c)  h -regularization  instead  of  Z2. 

3.3.1  Data 

We  employed  four  benchmark  data  sets  previously  used  to  evaluate  a 
variety  of  ILP  and  relational  learning  algorithms.  They  concern  predicting 
the  relative  biochemical  activity  of  variants  of  Tacrine,  a  drug  for  Alzheimer’s 
disease  (King  et  al.,  1995).  The  data  contain  background  knowledge  about  the 
physical  and  chemical  properties  of  substituents  such  as  their  hydrophobicity 
and  polarity,  the  relations  between  various  physical  and  chemical  constants, 
and  other  relevant  information.  The  goal  is  to  compare  various  drugs  on  four 
important  biochemical  properties:  low  toxicity,  high  acetyl  cholinesterase 
inhibition,  good  reversal  of  scopolamine-induced  memory  impairment,  and 
inhibition  of  amine  re-uptake.  For  each  property,  the  positive  and  negative 
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Table  3.1:  Some  background  evidence  and  examples  from  the  Alzheimer  toxic 
dataset. 


Background  evidence 

Examples 

r_subst_l(Al,H),  r_subst_l(Bl,H),  r_subst  _1(D1,H), 
x_subst(Bl,7,CL),  polar(CL,POLAR3),  size(CL,SIZEl), 
alk_groups(Al,0),  alk_groups(Bl,0),  alk_groups(Dl,0) 

less  Toxic  (B 1 ,  A 1 ) 
less_toxic(Al,Dl) 
less_toxic(Bl,Dl) 

examples  are  pairwise  comparisons  of  drugs.  For  example,  lessJoxic(di,d,2) 
means  that  drug  d\s  toxicity  is  less  than  d2’s.  These  ordering  relations  are 
transitive  but  not  complete  (i.e.  for  some  pairs  of  drugs  it  is  unknown  which 
one  is  better).  Therefore,  this  is  a  structured  (a.k.a.  collective)  prediction 
problem  since  the  output  labels  should  form  a  partial  order.  However,  previous 
work  has  ignored  this  structure  and  just  predicted  the  examples  separately  as 
distinct  binary  classification  problems.  In  this  work,  in  addition  to  treating 
the  problem  as  independent  classification,  we  also  use  an  MLN  to  perform 
structured  prediction  by  explicitly  imposing  the  transitive  constraint  on  the 
target  predicate.  Table  3.1  shows  some  background  facts  and  examples  from 
one  of  the  datasets,  and  Table  3.2  summarizes  information  about  all  four 
datasets. 


Table  3.2:  Summary  statistics  for  Alzheimer’s  data  sets. 


Data  set 

^Examples 

%  Positive 

#  Predicates 

Alzheimer  acetyl 

1326 

50% 

30 

Alzheimer  amine 

686 

50% 

30 

Alzheimer  memory 

642 

50% 

30 

Alzheimer  toxic 

886 

50% 

30 
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3.3.2  Methodology 

To  answer  the  above  questions,  we  ran  experiments  with  the  following 

systems: 

Alchemy:  Uses  the  structure  learning  algorithm  MSL  (Kok  &  Domingos, 
2005)  in  Alchemy  and  the  most  accurate  existing  discriminative  weight 
learning  PSCG  (Lowd  &  Domingos,  2007)  with  the  “ne”  (non-evidence) 
parameter  set  to  the  target  predicate. 

Busl:  Uses  Busl  (Mihalkova  &  Mooney,  2007)  and  PSCG  discriminative 
weight  learning  with  the  “ne”  (non-evidence)  parameter  set  to  the  target 
predicate. 

Aleph:  Uses  Aleph’s  standard  settings  with  a  few  modifications.  The  max¬ 
imum  number  of  literals  in  an  acceptable  clause  was  set  to  5.  The  mini¬ 
mum  number  of  positive  examples  covered  by  an  acceptable  clause  was 
set  to  2.  The  upper  bound  on  the  number  of  negative  examples  cov¬ 
ered  by  an  acceptable  clause  was  set  to  300.  The  evaluation  function 
was  set  to  auto_m,  and  the  minimum  score  of  an  acceptable  clause  was 
set  to  0.6.  The  induce_cover  command  was  used  to  learn  the  clauses. 
We  found  that  this  configuration  gave  somewhat  better  overall  accuracy 
compared  to  those  reported  in  previous  work. 

AlephPSCG:  Uses  the  discriminative  weight  learner  PSCG  to  learn  MLN 
weights  for  the  clauses  in  the  final  theory  returned  by  Aleph.  Note  that 
PSCG  also  uses  ^-regularization. 
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ALEPHExactL2  :  Uses  the  limited- memory  BFGS  algorithm  (Liu  &  Nocedal, 
1989)  implemented  in  Alchemy  to  learn  discriminative  MLN  weights 
for  the  clauses  in  the  final  theory  returned  by  Aleph.  The  objective 
function  is  CLL  with  l2  regularization.  The  CLL  is  computed  exactly  as 
described  in  Section  3.2.2. 

AlephH — bPSCG:  Like  AlephPSCG,  but  learns  weights  for  the  larger  set  of 
clauses  returned  by  Aleph++. 

AlephH — |-ExactL2:  Like  ALEPHExactL2,  but  learns  weights  for  the  larger 
set  of  clauses  returned  by  Aleph++. 

AlephH — HExactLl:  Our  full  proposed  approach  using  exact  inference  and 
1 1 -regularization  to  learn  weights  on  the  clauses  returned  by  Aleph++. 

To  force  the  predictions  for  the  target  predicate  to  properly  constitute 
a  partial  ordering,  we  also  tried  adding  to  the  learned  MLNs  a  hard  constraint 
(i.e.  a  clause  with  infinite  weight)  stating  the  transitive  property  of  the  target 
predicate,  and  used  the  MC-SAT  algorithm  to  perform  prediction  on  the  test 
data.  This  exploits  the  ability  of  MLNs  to  perform  collective  classification  for 
the  complete  set  of  test  examples. 

In  testing,  only  the  background  facts  are  provided  as  evidence  to  en¬ 
sure  that  all  predictions  are  based  on  the  chemical  structure  of  a  drug.  For 
all  systems  except  Aleph,  a  threshold  of  0.5  was  used  to  convert  predicted 
probabilities  into  boolean  values.  The  predictive  accuracy  of  these  algorithms 
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Table  3.3:  Average  predictive  accuracies  and  standard  deviations  for  all  sys¬ 
tems.  Bold  numbers  indicate  the  best  result  on  a  data  set. 


Data  set 

Alchemy 

BUSL 

Aleph 

Aleph 

PSCG 

Aleph 

Exact  L  2 

Aleph++ 

PSCG 

Aleph++ 

ExactL2 

Aleph++ 

Exact  LI 

Alzheimer  amine 
Alzheimer  toxic 
Alzheimer  acetyl 
Alzheimer  memory 

50.1  ±  0.5 
54.7  ±  7.4 

48.2  ±  2.9 
50  ±  0.0 

51.3  ±  2.5 

51.7  ±  5.3 
55.9  ±  8.7 

49.8  ±  1.6 

81.6  ±  5.1 

81.7  ±  4.2 
79.6  ±  2.2 
76.0  ±  4.9 

64.6±  4.6 
74.7±  1.9 
78. 0±  3.2 
60. 3±  2.1 

83.5  ±  4.7 

87.5  ±  4.8 

79.5  ±  2.0 

72.6  ±  3.4 

72.0±  5.2 
69.9±  1.2 
76. 5±  3.7 
65.6±  5.4 

86. 8±  4.4 
89.5±  3.0 
82. 1±  2.1 
72. 9±  5.2 

89.4  ±  2.7 
91.3  ±  2.8 
85.1  ±  2.4 
77.6  ±  4.9 

Table  3.4:  Average  AUC-ROC  and  standard  deviations  for  all  systems.  Bold 
numbers  indicate  the  best  result  on  a  data  set. 


Data  set 

Alchemy 

BUSL 

Alepii 

PSCG 

Aleph 

Exact  L  2 

Aleph ++ 

PSCG 

Aleph++ 

Exact  L  2 

Aleph++ 

ExactLl 

Alzheimer  amine 
Alzheimer  toxic 
Alzheimer  acetyl 
Alzheimer  memory 

.483  ±  .115 
.622  ±  .079 
.473  ±  .037 
.452±  .088 

.641  ±  .110 
.511  ±  .079 
.588  ±  .108 
.426  ±  .065 

.846  ±  .041 
.904  ±  .034 
.850  ±  .018 
.744  ±  .040 

.904  ±  .027 
.930  ±  .035 
.850  ±  .020 
.768  ±  .032 

.777  ±  .052 
.874  ±  .041 
.810  ±  .040 
.737  ±  .059 

.935  ±  .032 
.937  ±  .029 
.899  ±  .015 
.813  ±  .059 

.954  ±  .019 
.939  ±  .035 
.916  ±  .013 
.844  ±  .052 

for  the  target  predicate  were  compared  using  10-fold  cross-validation.  The 
significance  of  the  results  were  evaluated  using  a  two-tailed  paired  t-test  test 
with  a  95%  confidence  level.  To  compare  the  quality  of  the  predicted  prob¬ 
abilities,  we  also  report  the  average  area  under  the  ROC  curve  (AUC-ROC) 
(Provost,  Fawcett,  &  Kohavi,  1998)  for  all  probabilistic  systems  by  using  the 
AUCCalculator  package  (Davis  &  Goadrich,  2006). 

3.3.3  Results  and  Discussion 

Tables  3.3  and  3.4  show  the  average  accuracy  and  AUC-ROC  with 
standard  deviation  for  each  system  running  on  each  data  set.  Our  complete 
system  (ALEPH++ExactLl)  achieves  significantly  higher  accuracy  than  both 
Alchemy  and  Busl  on  all  4  data  sets  and  significantly  higher  than  Aleph 
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Table  3.5:  Average  predictive  accuracies  and  standard  deviations  for  MLN 
systems  with  transitive  clause  added. 


Data  set 

Alchemy 

BUSL 

Aleph 

PSCG 

Aleph 

Exact  L  2 

Aleph++ 

PSCG 

Aleph++ 

ExactL2 

Aleph++ 

Exact  LI 

Alzheimer  amine 
Alzheimer  toxic 
Alzheimer  acetyl 
Alzheimer  memory 

50.0  ±  0.0 
50.0  ±  0.0 
53.0  ±  6.2 
50.0  ±  0.0 

52.2  ±  5.3 

50.1  ±  0.8 

54.1  ±  4.9 

50.1  ±  0.5 

61.4  ±  3.6 

73.3  ±  1.8 

80.4  ±  2.7 
58.9  ±  2.3 

87.0  ±  3.3 
88.8  ±  4.8 
84.1  ±  3.1 
76.5  ±  3.5 

72. 9±  3.5 
68.4d=  1.5 
83. 3±  2.5 
70. 1±  5.2 

91. 7±  3.5 

91. 4±  3.6 

88. 7±  2.1 

81. 3±  4.8 

90.5  ±  3.6 

91.9  ±  4.1 

87.6  ±  2.7 

81.3  ±  4.1 

Table  3.6:  Average  number  of  clauses  learned 


Data  set 

ALEPH++ 

Aleph+- l- 

ExactL2 

ALEPH++ 

ExactLl 

Alzheimer  amine 

7061 

5070 

3477 

Alzheimer  toxic 

2034 

1194 

747 

Alzheimer  acetyl 

8662 

5427 

2433 

Alzheimer  memory 

6524 

4250 

2471 

on  all  except  the  memory  data  set,  answering  questions  1(a)  and  1(b).  In 
turn,  Aleph  has  been  shown  to  give  higher  accuracy  on  these  data  sets  than 
other  standard  ILP  systems  like  Foil  (Landwehr  et  ah,  2007).  Both  MSL  and 
BUSL  find  only  a  few  (3-5)  simple  clauses.  Two  of  them  are  unit  clauses  for 
the  target  predicate,  such  as  great. ne(al,al)  and  great.ne(al,a2)]  the  others 
capture  the  transitive  nature  of  the  target  relation.  Therefore,  even  after  they 
are  discriminatively  weighted,  their  predictions  are  not  significantly  better 
than  random  guessing. 

The  ablations  that  remove  components  from  our  overall  system  demon¬ 
strate  the  important  contribution  of  each  component.  Regarding  question 
2(b),  the  systems  using  general  approximate  inference  (AlephPSCG  and 
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Table  3.7:  Average  predictive  accuracies  and  standard  deviations  of  our  best 
results  and  other  “advanced”  ILP  systems. 


Data  set 

Our  best  results 

tFOIL 

kFOIL 

Rumble 

Alzheimer  amine 

91. 7±  3.5 

87.5  ±  4.4 

88.8  ±  5.0 

91.1 

Alzheimer  toxic 

91.9  ±  4.1 

92.1  ±  2.6 

89.3  ±  3.5 

91.2 

Alzheimer  acetyl 

88. 7±  2.1 

82.8  ±  3.8 

87.8  ±  4.2 

88.4 

Alzheimer  memory 

81.3  ±  4.1 

80.4  ±  5.3 

80.2  ±  4.0 

83.2 

ALEPH++PSCG)  perform  much  worse  than  the  corresponding  versions  that 
use  exact  inference  (ALEPHExactL2  and  ALEPH++ExactL2).  Therefore, 
when  there  is  a  target  predicate  that  can  be  accurately  inferred  using  non¬ 
recursive  clauses,  exploiting  this  restriction  to  perform  exact  inference  is  a 
clear  win. 

Regarding  question  2(a),  ALEPH++ExactL2  performs  significantly 
better  than  ALEPHExactL2,  demonstrating  the  advantage  of  learning  a  large 
set  of  potential  clauses  and  combining  them  with  learned  weights  in  an  overall 
MLN.  Across  the  four  datasets,  Aleph++  returns  an  average  of  6,  070  clauses 
compared  to  only  10  for  Aleph. 

Table  3.5  presents  average  accuracies  with  standard  deviations  for  the 
MLN  systems  when  we  include  a  transitivity  clause  for  the  target  predicate. 
This  constraint  improves  the  accuracies  of  ALEPHExactL2,  ALEPH++ExactL2, 
and  ALEPH++ExactLl,  but  sometimes  decreases  the  accuracy  of  other  sys¬ 
tems,  such  as  AlephPSCG.  This  can  be  explained  as  follows.  Since  most 
of  the  predictions  of  ALEPH++ExactLl  are  correct,  enforcing  transitivity 
can  correct  some  of  the  wrong  ones.  However,  AlephPSCG  produces  many 
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wrong  predictions,  so  forcing  them  to  obey  transitivity  can  produce  additional 
incorrect  predictions. 

Regarding  question  2(c),  using  ^-regularization  gives  significantly  higher 
accuracy  and  AUC-ROC  than  using  standard  ^-regularization.  This  compar¬ 
ison  was  only  performed  for  ALEPH++  since  this  is  when  the  weight- learner 
must  choose  from  a  large  set  of  candidate  clauses  by  encouraging  zero  weights. 
Table  3.6  compares  the  average  number  of  clauses  learned  (after  zero-weight 
clauses  are  removed)  for  b  and  l2  regularization.  As  expected,  the  final 
learned  MLNs  are  much  simpler  when  using  l \ -regularization.  On  average, 
b -regularization  reduces  the  size  of  the  final  set  of  clauses  by  26%  compared 
to  b-regularization. 

Regarding  question  1(c),  several  researchers  have  tested  “advanced” 
ILP  systems  on  our  datasets.  Table  3.7  compares  our  best  results  to  those  re¬ 
ported  for  tFOIL  (a  combination  of  FOIL  and  tree  augmented  naive  Bayes), 
kFOIL  (a  kernclized  version  of  FOIL),  and  Rumble  (a  max- margin  approach 
to  learning  a  weighted  rule  set).  Our  results  are  competitive  with  these  recent 
systems.  Additionally,  unlike  MLNs,  these  methods  do  not  create  “declara¬ 
tive”  theories  that  have  a  well-defined  possible  worlds  semantics. 

3.4  Related  Work 

Using  an  off-the-shelf  ILP  system  to  learn  clauses  for  MLNs  is  not  a  new 
idea.  Richardson  and  Domingos  (2006)  used  Claudien,  an  non- discriminative 
ILP  system  that  can  learn  arbitrary  first-order  clauses,  to  learn  MLN  structure 
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and  to  refine  the  clauses  from  a  knowledge  base.  Kok  and  Domingos  (2005) 
reported  experimental  results  comparing  their  MLN  structure  learner  to  learn¬ 
ing  clauses  using  Claudien,  FOIL,  and  Aleph.  However,  since  this  previous 
work  used  the  relatively  small  set  of  clauses  produced  by  these  unaltered  ILP 
systems,  the  performance  was  not  very  good.  ILP  systems  have  also  been  used 
to  learn  structures  for  other  SRL  models.  The  Sayu  system  (Davis,  Burnside, 
de  Castro  Dutra,  Page,  &  Costa,  2005)  used  Aleph  to  propose  candidate 
features  for  a  Bayesian  network  classifier.  Muggleton(Muggleton,  2000)  used 
Progol,  another  popular  ILP  system,  to  learn  clauses  for  Stochastic  Logic 
Programs  (SLPs). 

When  restricted  to  learning  non-recursive  clauses  for  classification,  our 
approach  is  equivalent  to  using  Aleph  to  construct  features  for  use  by  l\- 
regularized  logistic  regression.  Under  this  view,  our  approach  is  closely  re¬ 
lated  to  M accent  (Dehaspe,  1997),  which  uses  a  greedy  approach  to  induce 
clausal  constraints  that  are  used  as  features  for  maximum-entropy  classifica¬ 
tion.  One  difference  between  our  approach  and  Maccent  is  that  we  use  a 
two-step  process  instead  of  greedily  adding  one  feature  at  a  time.  In  addition, 
our  clauses  are  induced  in  a  bottom-up  manner  while  Maccent  uses  top- 
down  search;  and  our  weight  learner  employs  1 \  -regularization  which  makes 
it  less  prone  to  overfitting.  Unfortunately,  we  could  not  compare  experimen¬ 
tally  to  Maccent  since  “only  an  implementation  of  a  propositional  version  of 
MACCENT  is  available,  which  only  handles  data  in  attribute-value  (vector) 
format”  (Landwehr  et  al.,  2007).  Additionally,  MLNs  are  a  more  expressive 
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formalism  that  also  allows  for  structured  prediction,  as  demonstrated  by  our 
results  that  include  a  transitivity  constraint  on  the  target  relation. 

3.5  Chapter  Summary 

We  have  found  that  existing  methods  for  learning  Markov  Logic  Net¬ 
works  perform  very  poorly  when  tested  on  several  benchmark  ILP  problems  in 
drug  design.  We  present  a  new  approach  to  discriminatively  learns  both  the 
structure  and  parameter  of  an  MLN  with  non-recursive  clauses.  The  proposed 
approach  uses  a  variant  of  an  existing  ILP  system  (Aleph)  to  construct  a  large 
number  of  potential  clauses  and  then  effectively  learns  their  parameters  by  al¬ 
tering  existing  discriminative  MLN  weight-learning  methods  to  utilize  exact 
inference  and  l\  regularization.  Experimental  results  show  that  the  resulting 
system  outperforms  existing  MLN  and  ILP  methods  and  gives  state-of-the-art 
results  for  the  Alzheimer’s-drug  benchmarks. 
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Chapter  4 


Max-Margin  Weight  Learning  for  MLNs 

4.1  Introduction 

In  the  previous  chapter,  we  aim  to  learn  a  model  that  maximizes  the 
CLL  of  the  data.  If  the  goal  is  to  predict  accurate  target-predicate  proba¬ 
bilities,  that  approach  is  well  motivated.  However,  in  many  applications,  the 
actual  goal  is  to  maximize  an  alternative  performance  metric  such  as  classifi¬ 
cation  accuracy  or  F-measure.  Max-margin  training  provides  a  framework  for 
maximizing  a  variety  of  performance  metrics  (Joachims,  2005).  In  this  chapter, 
we  present  a  max-margin  approach  to  weight  learning  in  MLNs  based  on  the 
general  framework  of  max-margin  training  for  structured  prediction  (section 
2.4). 

The  remainder  of  the  chapter  is  organized  as  follows.  Section  4.2  for¬ 
mulates  the  max-margin  weight  learning  problem.  Section  4.3  discusses  ap¬ 
proximate  inference  for  MLNs  based  on  Linear  Programming  (LP)  relaxation. 
Section  4.4  reports  experimental  evaluation.  Section  4.5  discusses  related  work 
and  section  4.6  summarizes  the  chapter. 
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4.2  Max-Margin  Formulation 

All  of  the  current  discriminative  weight  learners  for  MLNs  try  to  find 
a  weight  vector  w  that  optimizes  the  conditional  log- likelihood  P(y|x)  of  the 
query  atoms  y  given  the  evidence  x.  However,  an  alternative  approach  is  to 
learn  a  weight  vector  w  that  maximizes  the  ratio: 

P(yM 

f(yW 

between  the  probability  of  the  correct  truth  assignment  y  and  the  closest 
competing  incorrect  truth  assignment  y  =  argmaXygY\y  P(y|x)-  Applying 
equation  2.1  and  taking  the  log,  this  problem  translates  to  maximizing  the 
margin: 

7(x,  y;  w)  =  wrn(x,  y)  -  wTn(x,  y) 

=  wrn(x,  y)  —  max  w7  n(x,  y) 

yeY\y 

Note  that,  this  translation  holds  for  all  log-linear  models  (Collins,  2004).  For 
example,  if  we  apply  it  to  a  CRF  (Lafferty,  McCallum,  &  Pereira,  2001)  then 
the  resulting  model  is  an  M3N  (Taskar  et  ah,  2004).  Similarly,  when  changing 
the  objective  of  MLNs  to  maximize  the  margin,  we  create  a  max-margin  version 
of  MLNs,  abbreviated  as  M3LNs. 

In  turn,  the  max-margin  problem  above  can  be  formulated  as  a  “1- 
slaek”  structural  SVM  as  described  in  section  2.4: 
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Optimization  Problem  4  (OP4):  Max-Margin  Markov  Logic  Networks 

1  T 

min  -w  w  +  Cf 

w,£>0  2 

s.t.  Vy  G  Y  :  wT[n(x,  y)  -  n(x,  y)]  >  A(y,  y)  -  f 

So  for  MLNs,  the  number  of  true  groundings  of  the  clauses  n(x,  y)  plays 
the  role  of  the  feature  vector  function  0(x,  y)  in  the  general  structural  SVM 
problem.  In  other  words,  each  clause  in  an  MLN  can  be  viewed  as  a  feature 
representing  a  dependency  between  a  subset  of  inputs  and  outputs  or  a  relation 
among  several  outputs. 

As  mentioned,  in  order  to  apply  Algorithm  2.1  to  MLNs,  we  need  al¬ 
gorithms  for  solving  the  following  two  problems: 

Prediction:  arg  maxygy  w2  n(x,  y) 

Separation  Oracle:  arg maXyey{A(y, y)  +  w2 n(x, y)} 

The  prediction  problem  is  just  the  (intractable)  MPE  inference  problem  dis¬ 
cussed  in  section  2.3.  We  can  use  MaxWalkSAT  to  get  an  approximate  so¬ 
lution,  but  we  have  found  that  models  trained  with  MaxWalkSAT  have  very 
low  predictive  accuracy.  On  the  other  hand,  recent  work  (Finley  &  Joachims, 
2008)  has  found  that  fully-connected  pairwise  Markov  random  fields,  a  special 
class  of  structural  SVMs,  trained  with  overgenerating  approximate  inference 
methods  (such  as  relaxation)  preserves  the  theoretical  guarantees  of  structural 
SVMs  trained  with  exact  inference,  and  exhibits  good  empirical  performance. 
Based  on  this  result,  we  sought  a  relaxation-based  approximation  for  MPE  in¬ 
ference.  We  first  present  an  LP-relaxation  algorithm  for  MPE  inference,  then 
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show  how  to  modify  it  to  solve  the  separation  oracle  problem  for  some  specific 
loss  functions. 

4.3  Approximate  Inference 

4.3.1  Approximate  MPE  inference  for  MLNs 

MPE  inference  in  MLNs  is  a  special  case  of  MAP  inference  in  Markov 
networks  with  binary  variables,  and  there  has  been  a  lot  of  work  on  approx¬ 
imation  algorithms  for  solving  MAP  inference  using  convex  relaxation,  see 
Kumar,  Kolmogorov,  and  Torr  (2009)  for  more  details.  However,  these  meth¬ 
ods  are  not  suitable  for  MLNs.  First,  most  of  them  are  for  Markov  networks 
with  unary  and  pairwise  potential  functions  while  a  ground  MLN  may  contain 
many  high-order  cliques.  The  algorithms  can  be  extended  to  handle  high- 
order  potential  functions  (Werner,  2008),  but  they  become  computationally 
expensive.  Second,  they  do  not  handle  deterministic  factors,  i.e.  potential 
functions  with  some  entries  are  zero.  On  the  other  hand,  MPE  inference  in 
MLNs  is  equivalent  to  the  Weighted  MAX-SAT  problem,  and  there  are  also 
significant  work  on  approximating  this  NP-hard  problem  using  LP-relaxation 
(Asano  &  Williamson,  2002;  Asano,  2006).  The  existing  algorithms  first  re¬ 
lax  and  convert  the  Weighted  MAX-SAT  problem  into  a  linear  or  semidehnite 
programming  problem,  then  solve  it  and  apply  a  randomized  rounding  method 
to  obtain  an  approximate  integral  solution.  These  methods  cannot  be  directly 
applied  to  MLNs,  since  they  require  the  weights  to  be  positive  while  MLN 
weights  can  be  negative  or  infinite.  However,  we  can  modify  the  conversion 
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used  in  these  approaches  to  handle  the  case  of  negative  and  infinite  weights. 

Based  on  the  evidence  and  the  closed  world  assumption,  a  ground  MLN 
contains  only  ground  clauses  of  the  unknown  ground  atoms  after  removing  all 
trivially  satisfied  and  unsatisfied  clauses.  The  following  procedure  translates 
the  MPE  inference  in  a  ground  MLN  into  an  Integer  Linear  Programming 
problem. 


1.  Assign  a  binary  variable  yi  to  each  unknown  ground  atom,  yi  is  1  if 
the  corresponding  ground  atom  is  TRUE  and  0  if  the  ground  atom  is 
FALSE. 


2.  For  each  ground  clause  Cj  with  infinite  weight,  add  the  following  linear 
constraint  to  the  Integer  Linear  Programming  problem: 


J2Vi  +  -  1 


U,J  i: 


where  /+,  I-  are  the  sets  of  positive  and  negative  ground  literals  in 
clause  Cj  respectively. 

3.  For  each  ground  clause  Cj  with  positive  weight  Wj ,  introduce  a  new  aux¬ 
iliary  binary  variable  zv  add  the  term  WjZj  to  the  objective  function,  and 
add  the  following  linear  constraint  to  the  Integer  Linear  Programming 


problem: 


*e/+  iei; 


Zj  is  1  if  the  corresponding  ground  clause  is  satisfied. 
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4.  For  each  ground  clause  Cj  with  k  ground  literals  and  negative  weight 
Wj,  introduce  a  new  auxiliary  boolean  variable  Zj,  add  the  term  —WjZj 
to  the  objective  function  and  add  the  following  k  linear  constrains  to  the 


Integer  Linear  Programming  problem: 

1  -yi>Zj,  ielf 
Vi  >  Zj,  ie  Ij 

The  final  Integer  Linear  Programming  has  the  following  form: 

Optimization  Problem  5  (OP5): 

max  V''  WjZj  +  ~wjzj 

Vi  i^i 

CjGC+  CjGC- 

s.t.  J2v.+Yl  (1  —  2/^)  >  1  V  Cj  where  Wj  =  oo 
iei+  iei]\ 

+  VCj  e  C+ 

ie/+  ie/" 

1  —  Hi  >  Zj  V  %  G  Ij~  and  Cj  G  C~ 

Hi  >  Zj  Vie.  Ij  and  Cj  e  C~ 
yh  Zj  e  {0, 1} 

where  C+  and  C~  are  the  set  of  clauses  with  positive  and  negative  weights 
respectively.  This  Integer  Linear  Programming  problem  can  be  simplified  by 
not  introducing  an  auxiliary  variable  Zj  for  unit  clauses,  where  we  can  use 
the  variable  i/i  directly.  This  reduces  the  problem  considerably,  since  ground 
MLNs  typically  contain  many  unit  clauses  (Alchemy  combines  all  the  non¬ 
recursive  clauses  containing  the  query  atom  into  a  unit  clause  whose  weight 
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is  the  sum  of  all  the  clauses’  weights).  Note  that  our  mapping  from  a  ground 
MLN  to  an  Integer  Linear  Programming  problem  is  a  bit  different  from  the  one 
presented  by  Riedel  (2008)  which  generates  two  sets  of  constraints  for  every 
ground  clause:  one  when  the  clause  is  satisfied  and  one  when  it  is  not.  For 
a  clause  with  positive  weight,  our  mapping  only  generates  a  constraint  when 
the  clause  is  satisfied;  and  for  a  clause  with  negative  weight,  the  mapping  only 
imposes  constraints  when  the  clause  is  unsatisfied.  The  final  Integer  Linear 
Programming  problem  has  the  same  solution  with  the  one  in  (Riedel,  2008), 
but  it  has  fewer  constraints  since  our  mapping  does  not  generate  unnecessary 
constraints.  We  then  relax  the  integer  constraints  yt ,  Zj  G  {0,1}  to  linear 
constraints  yt,  Zj  G  [0, 1]  to  obtain  an  LP-relaxation  of  the  MPE  problem. 

This  LP  problem  can  be  solved  by  any  general  LP  solver.  If  the  LP 
solver  returns  an  integral  solution,  then  it  is  also  the  optimal  solution  to  the 
original  Integer  Linear  Programming  problem.  In  our  case,  the  original  Integer 
Linear  Programming  problem  is  an  NP-hard  problem,  so  the  LP  solver  usually 
returns  non-integral  solutions.  Therefore,  the  LP  solution  needs  to  be  rounded 
to  give  an  approximate  Integer  Linear  Programming  solution.  We  first  tried 
some  of  the  randomized  rounding  methods  in  (Asano,  2006)  but  they  gave 
poor  results  since  the  LP  solution  has  a  lot  of  fractional  components  with 
value  0.5.  We  then  adapted  a  rounding  procedure  called  ROUNDUP  (Boros 
&  Hammer,  2002),  a  procedure  for  producing  an  upper  bound  binary  solution 
for  a  pseudo-Boolean  function,  to  the  case  of  pseudo-Boolean  functions  with 
linear  constraints  (algorithm  4.1),  which  we  found  to  work  well.  In  each  step, 
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Algorithm  4.1  The  modified  ROUNDUP  procedure 
1:  Input:  The  LP  solution  y  =  {j/i,  ...,yn} 

2:  F  < —  0 

3:  for  i  =  1  to  n  do 

4:  if  yi  is  integral  then 

5:  Remove  all  the  ground  clauses  satisfied  by  assigning  the  value  of  yt  to  the 

corresponding  ground  atom 
6:  else 

7:  add  yi  to  F 

8:  end  if 

9:  end  for 
10:  repeat 

11:  Remove  the  last  item  yi  in  F 

12:  Compute  the  sum  w+  of  the  unsatisfied  clauses  where  yi  appears  as  a  positive 

literal 

13:  Compute  the  sum  w~  of  the  unsatisfied  clauses  where  y%  appears  as  a  negative 

literal 

14:  if  w+  >  w~  then 

15:  yi  -  1 

16:  else 

17:  yt  <-  0 

18:  end  if 

19:  Remove  all  the  ground  clauses  satisfied  by  assigning  the  value  of  y*  to  the 

corresponding  ground  atom 
20:  until  F  is  empty 
21:  return  y 


this  procedure  picks  one  fractional  component  and  rounds  it  to  1  or  0.  Hence, 
this  process  terminates  in  at  most  n  steps,  where  n  is  the  number  of  query 
atoms.  Note  that  due  to  the  dependencies  between  the  variables  y,:’s  and  z/s 
(the  linear  constraints  of  the  LP  problem),  this  modified  ROUNDUP  procedure 
does  not  guarantee  an  improvement  in  the  value  of  the  objective  function  in 
each  step  like  the  original  ROUNDUP  procedure  where  all  the  variables  are 
independent. 
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4.3.2  Approximation  algorithm  for  the  separation  oracle 

The  separation  oracle  adds  an  additional  term,  the  loss  term,  to  the 
objective  function.  So,  if  we  can  represent  the  loss  as  a  linear  function  of  the 
Hi  variables  of  the  LP-relaxation,  then  we  can  use  the  above  approximation 
algorithm  to  also  approximate  the  separation  oracle.  In  this  work,  we  consider 
two  loss  functions.  The  first  one  is  the  0/1  loss  function,  A0/i(yT,y)  where 
yT  is  the  true  assignment  and  y  is  some  predicted  assignment.  For  this  loss 
function,  the  separation  oracle  is  the  same  as  the  MPE  inference  problem  since 
the  loss  function  only  adds  a  constant  1  to  the  objective  function.  Hence,  in 
this  case,  to  find  the  most  violated  constraint,  we  can  use  the  LP-relaxation 
algorithm  above  or  any  other  MPE  inference  algorithm.  This  0/1  loss  makes 
the  separation  oracle  problem  easier  but  it  does  not  scale  the  margin  by  how 
different  yT  and  y  are.  It  only  requires  a  unit  margin  for  all  assignments  y 
different  from  the  true  assignment  yT.  To  take  into  account  this  problem,  we 
consider  the  second  loss  function  that  is  the  number  of  misclassified  atoms  or 
the  Hamming  loss: 

n 

^Hamming(yT  ,y)  =  ^[yf  +  Vi] 

i 

n 

=  Yy<£  =  0  A  Vi  =  1)  V  (yT  =  1  A  Vi  =  0)] 

i 

From  the  definition,  this  loss  can  be  represented  as  a  function  of  the  y/s: 

A Hamming  (y  A)  ^  ^  Vi  T  ^  ^  (1  Hi) 

i:yf=0  i:y¥-=l 
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which  is  equivalent  to  adding  1  to  the  coefficient  of  yx  if  the  true  value  of  y,; 
is  0  and  subtracting  1  from  the  coefficient  of  y,  if  the  true  value  of  y,  is  1.  So 
we  can  use  the  LP-relaxation  algorithm  above  to  approximate  the  separation 
oracle  with  this  Hamming  loss  function.  Another  possible  loss  function  is  the 
F\  loss  which  is  equivalent  to  1  —  F\.  Unfortunately,  this  loss  is  a  non-linear 
function,  so  we  cannot  use  the  above  approach  to  optimize  it.  Developing 
algorithms  for  optimizing  or  approximating  this  loss  function  is  an  area  for 
future  work. 

4.4  Experimental  Evaluation 

This  section  presents  experiments  comparing  the  max-margin  weight 
learner  to  the  weight  learners  in  section  3.2  and  the  PSCG  algorithm. 

4.4.1  Datasets 

Besides  those  Alzheimer’s  datasets  described  in  section  3.3.1,  we  also 
ran  experiments  on  two  other  large,  real-world  datasets:  WebKB  for  collective 
web-page  classification,  and  CiteSeer  for  bibliographic  citation  segmentation. 

The  WebKB  dataset,  mentioned  in  chapter  1,  consists  of  labeled  web 
pages  from  the  computer  science  departments  of  four  universities.  Different 
versions  of  this  data  have  been  used  in  previous  work.  To  make  a  fair  com¬ 
parison,  we  used  the  version  from  (Lowd  &  Domingos,  2007),  which  contains 
4,165  web  pages  and  10,935  web  links.  Each  page  is  labeled  with  a  subset  of 
the  categories:  course ,  department ,  faculty,  person ,  professor,  research  project, 
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and  student.  The  goal  is  to  predict  these  categories  from  the  words  and  links 
on  the  web  pages.  We  used  the  same  simple  MLN  from  (Lowd  &  Domingos, 
2007),  which  only  has  clauses  relating  words  to  page  classes,  and  page  classes 
to  the  classes  of  linked  pages. 

H as (+ word,  page)  PageClass(+class,page ) 

-i Has(+word,page)  =>  PageClass(+class,page ) 
PageClass(+cl,pl)  A  Linkedfpl,  p2)  =$■  PageClass(+c2,  p2) 

The  plus  notation  creates  a  separate  clause  for  each  pair  of  word  and  page  class, 
and  for  each  pair  of  classes.  The  final  MLN  consists  of  10,891  clauses,  and  a 
weight  must  be  learned  for  each  one.  After  grounding,  each  department  results 
in  an  MLN  with  more  than  100,000  ground  clauses  and  5,000  query  atoms  in  a 
complex  network.  This  also  results  in  a  large  LP-relaxation  problem  for  MPE 
inference. 

For  CiteSeer  (Lawrence,  Giles,  &  Bollacker,  1999),  we  used  the  ver¬ 
sion  created  by  Poon  and  Domingos  (Poon  &  Domingos,  2007).  The  dataset 
contains  1,563  bibliographic  citations  such  as: 

J.  Jaffar,  J.  -  L.  Lassez.  Constraint  logic  programming.  In  Proceedings 
of  the  Fourteenth  ACM  symposium  of  the  principles  of  programming  languages, 
pages  111-119,  Munich,  1987. 

The  task  is  to  segment  each  of  these  citations  into  three  fields:  Au¬ 
thor,  Title  and  Venue.  The  dataset  has  four  independent  subsets  consisting 
of  citations  to  disjoint  publications  in  four  different  research  areas.  We  used 
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the  MLN  for  isolated  segmentation  model  in  (Poon  &  Domingos,  2007).  After 
grounding,  this  model  results  in  a  large  network  with  more  than  30,000  query 
atoms  and  110,000  ground  clauses. 

All  the  datasets  except  Alzheimer’s  datasets  and  MLNs  can  be  found 
at  the  Alchemy  website.1 

4.4.2  Methodology 

For  the  max-margin  weight  learner,  we  used  a  simple  process  for  se¬ 
lecting  the  value  of  the  C  parameter.  For  each  train/test  split,  we  trained  the 
algorithm  with  five  different  values  of  C:  1,  10,  100,  1000,  and  10000,  then 
selected  the  one  which  gave  the  highest  average  F\  score  on  training.  The 
e  parameter  was  set  to  0.001.  To  solve  the  QP  problems  in  Algorithm  2.1 
and  LP  problems  in  the  LP-relaxation  MPE  inference,  we  used  the  Mosek  2 
solver.  The  PSCG  algorithm  was  carefully  tuned  by  its  author.  For  MC-SAT, 
we  used  the  default  setting,  100  burn-in  and  1000  sampling  iterations,  and 
predict  that  an  atom  is  true  iff  its  probability  is  at  least  0.5. 

For  the  Alzheimer’s  datasets,  we  used  the  same  experimental  setup 
mentioned  in  section  3.3.2,  and  ran  four-fold  cross-validation  (i.e.  leave  one 
university /topic  out)  on  the  WebKB  and  CiteSeer  datasets. 

We  used  Fi  to  measure  the  performance  of  each  algorithm  on  the  We¬ 
bKB  and  CiteSeer  datasets. 

1  http:  //  alchemy,  cs .  Washington .  edu 
2http://www.  mosek.  com/ 
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Table  4.1:  F\  scores  on  WebKB 


Cornell 

Texas 

Washington 

Wisconsin 

Average 

PSCG-MCSAT 

0.418 

0.298 

0.577 

0.568 

0.465 

PSCG-LPRelax 

0.420 

0.310 

0.588 

0.575 

0.474 

MM-i0/1  -MaxWalkSAT 

0.150 

0.162 

0.122 

0.122 

0.139 

MM-Ag/x-LPRelax 

0.282 

0.372 

0.675 

0.521 

0.462 

MM-Ajj„mmi„9-LP  Relax 

0.580 

0.451 

0.715 

0.659 

0.601 

Table  4.2:  F\  scores  of  different  inference  algorithms  on  WebKB 


Cornell 

Texas 

Washington 

Wisconsin 

Average 

PSCG-MCSAT 

0.418 

0.298 

0.577 

0.568 

0.465 

PSCG-MaxWalkSAT 

0.161 

0.140 

0.119 

0.129 

0.137 

PSCG-LPRelax 

0.420 

0.310 

0.588 

0.575 

0.474 

MM-Ajiammi„9-MCSAT 

0.470 

0.370 

0.573 

0.481 

0.473 

MM-AHammi„3-MaxWalkSAT 

0.185 

0.184 

0.150 

0.154 

0.168 

MM-Aj}ammj„9-LP  Relax 

0.580 

0.451 

0.715 

0.659 

0.601 

4.4.3  Results  and  Discussion 

Table  4.1  and  4.3  present  the  performance  of  different  systems  on  the 
WebKB  and  Citeseer  datasets.  Each  system  is  named  by  the  weight  learner 
used,  the  loss  function  used  in  training,  and  the  inference  algorithm  used  in 
testing.  For  max-margin  (MM)  learner  with  margin  rescaling,  the  inference 
used  in  training  is  the  loss-augmented  version  of  the  one  used  in  testing.  For 
example,  M M-  A  Hamming-^ P Rel ax  is  the  max-margin  weight  learner  using  the 
loss-augmented  (Hamming  loss)  LP-relaxation  MPE  inference  algorithm  in 
training  and  the  LP-relaxation  MPE  inference  algorithm  in  testing. 

Table  4.1  shows  that  the  model  trained  using  MaxWalkSAT  has  very 
low  predictive  accuracy.  This  result  is  consistent  with  the  result  presented  in 
(Riedel,  2008)  which  also  found  that  the  MPE  solution  found  by  MaxWalkSAT 
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Table  4.3:  F\  scores  on  CiteSeer 


Constraint 

Face 

Reasoning 

Reinforcement 

Average 

PSCG-MCSAT 

0.937 

0.914 

0.931 

0.975 

0.939 

MM-AHammin9-LPRelax 

0.933 

0.922 

0.924 

0.958 

0.934 

Table  4.4:  F\  scores  on  CiteSeer  with  different  parameter  values 


Constraint 

Face 

Reasoning 

Reinforcement 

Average 

PSCG-MCSAT-5 

0.852 

0.844 

0.836 

0.923 

0.864 

PSCG-MCSAT-10 

0.937 

0.914 

0.931 

0.973 

0.939 

PSCG-MCSAT-15 

0.878 

0.896 

0.780 

0.891 

0.861 

PSCG-MCSAT-20 

0.850 

0.859 

0.710 

0.784 

0.801 

PSCG-MCSAT-100 

0.658 

0.697 

0.600 

0.668 

0.656 

MM-AHammi„g-LPRelax-l 

0.933 

0.922 

0.924 

0.955 

0.934 

MM-AHammi»9-LPRelax-10 

0.926 

0.922 

0.925 

0.955 

0.932 

MM-AHammi„3-LPRelax-100 

0.926 

0.922 

0.925 

0.954 

0.932 

MM-Ajjamming-LP  Relax- 1000 

0.931 

0.918 

0.925 

0.958 

0.933 

MM-AHammi„g-LPRelax-10000 

0.932 

0.922 

0.919 

0.968 

0.935 

is  not  very  accurate.  Using  the  proposed  LP-relaxation  MPE  inference  im¬ 
proves  the  F\  score  from  0.139  to  0.462,  the  MM-A0/i-LPRelax  system.  Then 
the  best  system  is  obtained  by  rescaling  the  margin  and  training  with  our 
loss-augmented  LP-relaxation  MPE  inference,  which  is  the  only  difference  be¬ 
tween  M M- A f/amrmnfl- LP Re  1  ax  and  MM-Ao/i-LPRclax.  The  MM-AHamming- 
LPRclax  achieves  the  best  F\  score  (0.601),  which  is  much  higher  than  the 
0.465  F\  score  obtained  by  the  previously  best  discriminative  weight  learner 
for  MLNs,  PSCG-MCSAT. 

Table  4.2  compares  the  performance  of  the  proposed  LP-relaxation 
MPE  inference  algorithm  against  MCSAT  and  MaxWalkSAT  on  the  best 
trained  models  by  PSCG  and  MM  on  the  WebKB  dataset.  In  both  cases, 
the  LP-relaxation  MPE  inference  achieves  much  better  iR  scores  than  those 
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Table  4.5:  Average  predictive  accuracies  and  standard  deviations  on 
Alzheimer’s  datasets  with  transitive  clause  added 


Data  set 

Aleph 
Exact  L  2 

Aleph++ 
Exact  L  2 

Aleph++ 
Exact  LI 

Aleph 

MM-LPRelax 

Aleph++ 

MM-LPRelax 

Aleph++ 

MM-Ll-LPRelax 

Alzheimer  amine 
Alzheimer  toxic 
Alzheimer  acetyl 
Alzheimer  memory 

87.0  ±  3.3 
88.8  ±  4.8 
84.1  ±  3.1 
76.5  ±  3.5 

91. 7±  3.5 
91. 4±  3.6 
88. 7±  2.1 
81. 3±  4.8 

90.5  ±  3.6 
91.9  ±  4.1 

87.6  ±  2.7 
81.3  ±  4.1 

87.0  ±  2.2 
88.5  ±  4.2 
86.3  ±  2.8 
79.1  ±  3.0 

89. 2±  2.9 
90.8±  3.6 
88.3±  2.9 
81. 5±  4.2 

88.8  ±  3.0 

91.6  ±  4.3 

87.9  ±  2.8 

80.7  ±  4.0 

Table  4.6:  Average  number  of  clauses  learned  on  Alzheimer’s  datasets 


Data  set 

Aleph 

Aleph++ 

ALEPH+- |- 
ExactL2 

Aleph++ 

ExactLl 

ALEPH+- P 
MM-LPRelax 

Aleph-|— p 
MM-L  1-LPRelax 

Alzheimer  amine 

10 

7061 

5070 

3477 

6981 

35 

Alzheimer  toxic 

9 

2034 

1194 

747 

2034 

25 

Alzheimer  acetyl 

12 

8662 

5427 

2433 

8621 

51 

Alzheimer  memory 

11 

6524 

4250 

2471 

6297 

31 

of  MCSAT  and  MaxWalkSAT.  This  demonstrates  that  the  approximate  MPE 
solution  found  by  the  LP-relaxation  algorithm  is  much  more  accurate  than 
the  one  found  by  the  MaxWalkSAT  algorithm.  The  fact  that  the  performance 
of  the  LP-relaxation  is  higher  than  that  of  MCSAT  shows  that  in  collective 
classification  it  is  better  to  use  the  MPE  solution  as  the  prediction  than  the 
marginal  prediction. 

For  the  WebKB  dataset,  there  are  other  results  reported  in  previous 
work,  such  as  those  in  (Taskar  et  ah,  2004),  but  those  results  cannot  be  directly 
compared  to  our  results  since  we  use  a  different  version  of  the  dataset  and  test 
on  a  more  complicated  task  (a  page  can  have  multiple  labels  not  just  one). 

On  the  Citeseer  results  presented  in  Table  4.3,  the  performance  of  max- 
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margin  methods  are  very  close  to  those  of  PSCG.  However,  its  performance 
is  much  more  stable.  Table  4.4  shows  the  performance  of  MM  weight  learners 
and  PSCG  with  different  parameter  values  by  varying  the  C  value  for  MM  and 
the  number  of  iterations  for  PSCG.  The  best  number  of  iterations  for  PSCG 
is  9  or  10.  In  principle,  we  should  run  PSCG  until  it  converges  to  get  the 
optimal  weight  vector.  However,  in  this  case,  the  performance  of  PSCG  drops 
drastically  on  both  training  and  testing  after  a  certain  number  of  iterations. 
For  example,  from  Table  4.4  we  can  see  that  at  10  iterations  PSCG  achieves  the 
best  F\  score  of  0.939,  but  after  15  iterations,  its  F\  score  drops  to  0.861  which 
is  much  worse  than  those  of  the  max-margin  weight  learners.  Moreover,  if  we 
let  it  run  until  100  iterations,  then  its  F\  score  is  only  0.656.  On  the  other  hand, 
the  performance  of  MM  only  varies  a  little  bit  with  different  values  of  C  and 
we  don’t  need  to  tune  the  number  of  iterations  of  MM.  On  this  dataset,  (Poon 
&  Domingos,  2007)  achieved  a  F\  score  of  0.944  with  the  same  MLN  by  using  a 
version  of  the  voted  perceptron  algorithm  called  Contrastive  Divergence  (CD) 
(Hinton,  2002)  to  learn  the  weights.  However,  the  performance  of  the  CD 
algorithm  is  very  sensitive  to  the  learning  rate  (Lowd  &  Domingos,  2007), 
which  requires  a  very  careful  tuning  process  to  learn  a  good  model. 

Table  4.5  and  4.6  compares  the  performance  of  the  MM  weight  learners 
against  the  some  of  the  systems  described  in  section  3.2  for  the  case  when  the 
transitive  clause  is  included.  For  the  MM  weight  learner,  instead  of  adding  the 
transitive  clause  to  the  learnt  MLNs  in  testing,  we  learned  the  weights  with 
the  presence  of  the  transitive  clause  since  it  can  handle  recursive  clauses.  In 
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term  of  the  accuracy,  the  MM  weight  learner  is  a  little  bit  worse  than  the  ones 
proposed  in  the  previous  chapter.  However,  the  1-norm  MM  weight  learner 
(MM-Ll-LPRelax)  produces  a  very  compact  model,  with  less  than  50  clauses, 
with  high  accuracy  while  the  models  learnt  by  other  systems  have  thousands 
of  clauses. 

Regarding  training  time,  the  max-margin  weight  learner  is  comparable 
to  other  learners.  On  the  Alzheimer’s  datasets,  it  took  less  than  100  iterations 
to  find  the  optimal  weights,  which  resulted  in  a  few  minutes  of  training.  For 
the  WebKB  and  CiteSeer  datasets,  the  number  of  training  iterations  are  about 
200  and  50  respectively,  which  takes  a  few  hours  of  training  for  WebKB  and 
less  than  an  hour  for  CiteSeer. 

4.5  Related  Work 

The  work  in  this  chapter  is  related  to  various  previous  projects.  Among 
them,  M3N  (Taskar  et  ah,  2004)  is  probably  the  most  related,  ft  is  a  special 
case  of  structural  SVMs  where  the  feature  function  </>(x,  y)  is  represented  by  a 
Markov  network.  When  the  Markov  network  can  be  triangulated  and  the  loss 
function  can  be  linearly  decomposed,  the  original  exponentially-sized  QP  can 
be  reformulated  as  a  polynomially-sized  QP  (Taskar  et  ah,  2004).  Then,  the 
polynomially-sized  QP  can  be  solved  by  general  QP  solvers  (Anguelov,  Taskar, 
Chatalbashev,  Roller,  Gupta,  Heitz,  &  Ng,  2005),  decomposition  methods 
(Taskar  et  ah,  2004),  extragradient  methods  (Taskar,  Lacoste-Julien,  &  Jor¬ 
dan,  2006),  or  exponentiated  gradient  methods  (Collins,  Globerson,  Koo,  Car- 
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reras,  &  Bartlett,  2008).  As  mentioned  by  Taskar  et  al.  (2004),  these  methods 
can  also  be  used  when  the  graph  cannot  be  triangulated,  but  the  algorithms 
only  yield  approximate  solutions  like  our  approach.  However,  these  algorithms 
are  restricted  to  the  cases  where  a  polynomially-sized  reformulation  exists 
(Joachims  et  ah,  2009).  Consequently,  in  this  work  we  used  the  general  cut¬ 
ting  plane  algorithm  which  imposes  no  restrictions  on  the  representation.  The 
ground  MLN  can  be  any  kind  of  graph.  On  the  other  hand,  since  an  MLN  is  a 
template  for  constructing  Markov  networks  (Richardson  &  Domingos,  2006), 
the  proposed  model,  M3LN,  can  also  be  seen  as  a  template  for  construct¬ 
ing  M3Ns.  Hence,  when  the  ground  MLN  can  be  triangulated  and  the  loss 
is  a  linearly  decomposable  function,  the  algorithms  developed  for  M3Ns  can 
be  applied.  Our  work  is  also  closely  related  to  the  Relational  Markov  Net¬ 
works  (RMNs)  (Taskar  et  ah,  2002).  However,  by  using  MLNs,  M3LNs  are 
more  powerful  than  RMNs  in  term  of  representation  (Richardson  &  Domingos, 
2006).  Besides,  the  objectives  of  M3LNs  and  RMNs  are  different.  One  tries 
to  maximize  the  margin  between  the  true  assignment  and  other  competing 
assignments,  and  one  tries  to  maximize  the  conditional  likelihood  of  the  true 
assignment.  Another  related  system  is  RUMBLE  (Riickert  &  Kramer,  2007), 
a  margin-based  approach  to  first-order  rule  learning.  In  that  work,  the  goal 
is  to  find  a  set  of  weighted  rules  that  maximizes  a  quantity  called  margin  mi¬ 
nus  variance.  However,  unlike  M3LNs,  RUMBLE  only  applies  to  independent 
binary  classification  problems  and  is  unable  to  perform  structured  prediction 
or  collective  classification.  In  terms  of  applying  the  general  structural  SVM 
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framework  to  a  specific  representation,  our  work  is  related  to  the  work  in 
(Szummer,  Kohli,  &  Hoicm,  2008)  which  used  CRFs  as  the  representation  and 
graph  cuts  as  the  inference  algorithm.  In  the  context  of  discriminative  learn¬ 
ing,  our  work  is  related  to  previous  work  on  discriminative  training  for  MLNs 
(Singla  &  Domingos,  2005;  Lowd  &  Domingos,  2007;  Biba  et  ah,  2008).  We 
have  mentioned  some  of  them  (Singla  &  Domingos,  2005;  Lowd  &  Domingos, 
2007)  in  previous  sections.  The  main  difference  between  the  work  in  (Biba 
et  al.,  2008)  and  ours  is  that  we  assume  the  structure  is  given  and  apply  max- 
margin  framework  to  learn  the  weights  while  (Biba  et  ah,  2008)  tries  to  learn 
a  structure  that  maximizes  the  conditional  likelihood  of  the  data. 

4.6  Chapter  Summary 

We  have  presented  a  max-margin  weight  learning  method  for  MLNs 
based  on  the  framework  of  structural  SVMs.  It  resulted  in  a  new  model,  M3LN, 
that  has  the  representational  expressiveness  of  MLNs  and  the  predictive  per¬ 
formance  of  SVMs.  M3LNs  can  be  trained  to  optimize  different  performance 
measures  depending  on  the  needs  of  the  application.  To  train  the  proposed 
model,  we  developed  a  new  approximation  algorithm  for  loss-augmented  MPE 
inference  in  MLNs  based  on  LP-relaxation.  The  experimental  results  showed 
that  the  new  max-margin  learner  generally  has  better  or  equally  good  but 
more  stable  predictive  accuracy  (as  measured  by  Pi)  than  the  previously  best 
discriminative  MLN  weight  learner. 
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Chapter  5 


Online  Max-Margin  Weight  Learning  for 

MLNs 


5.1  Introduction 

In  the  previous  chapter,  we  presented  a  max-margin  algorithm  to  learn¬ 
ing  weights  for  MLNs.  However,  like  other  existing  weight  learning  algorithms 
for  MLNs,  the  algorithm  uses  batch  training  which  becomes  computationally 
expensive  and  even  infeasible  for  very  large  datasets  since  the  training  exam¬ 
ples  may  not  fit  in  main  memory.  To  address  this  issue,  in  this  chapter,  we 
derive  a  new  online  max-margin  algorithm  for  structured  prediction  from  the 
primal-dual  framework  for  strongly  convex  loss  functions  (section  2.5). 

The  remainder  of  the  chapter  is  organized  as  follows.  Section  5.2 
presents  the  new  online  max-margin  algorithm  for  structured  prediction.  Sec¬ 
tion  5.3  reports  experimental  evaluation.  Section  5.4  discusses  related  work 
and  section  5.5  summarizes  the  chapter. 
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5.2  Online  Coordinate-Dual- Ascent  Algorithms  for  Max- 
Margin  Structured  Prediction 

In  this  section,  we  derive  new  online  algorithms  for  structured  predic¬ 
tion  based  on  the  algorithmic  framework  described  in  section  2.5  using  the 
CDA  update  rule.  A  standard  complexity  function  used  in  structured  predic¬ 
tion  is  /( w)  =  7j||w|||.  Regarding  the  loss  function  gt,  a  generalized  version 
of  the  Hinge  loss  is  widely  used  in  max-margin  structured  prediction  (Taskar 
et  al.,  2004;  Tsochantaridis  et  ah,  2004): 

Imm( w,  (xt,  yt))  = 

maxyey[A(yt,  y)  -  (w,  (0(xt,  yt)  -  0(xt,y))]  + 

However,  minimizing  the  above  loss  results  in  an  optimization  problem  with  a 
lot  of  constraints  in  the  primal  (one  constraint  for  each  possible  label  y  G  V) 
which  is  usually  expensive  to  solve.  To  overcome  this  problem,  we  consider 
two  simpler  variants  of  the  max-margin  loss  which  only  involves  a  particular 
label:  the  maximal  loss  function  and  the  prediction-based  loss  function. 

Maximal  loss  (ML)  function  This  loss  function  is  based  on  the 
maximal  loss  label  at  step  t,  y^L  =  argmaxygY{A(yt,  y)  +  (wt,  0(xf,y))}: 

Iml{ w,  (xt,yt))  = 

[A(yt,ytML)  -  (w,  (0(xt,  yt)  -  0(xt,  yfL)))]  + 

The  loss  Iml(w(,  (xt,  yt))  is  the  greatest  loss  the  algorithm  would  suffer  at 
step  t  if  it  used  the  maximal  loss  label  y fIL  as  the  prediction.  On  the 
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other  hand,  it  checks  whether  the  max-margin  constraints  are  satisfied  since 
if  /ML(wt,  (xt,yt))  =  0  then  yfIL  =  yt,  and  it  means  that  the  current  weight 
vector  wt  scores  the  correct  label  yt  higher  than  any  other  label  y[  where  the 
difference  is  at  least  A(yt,y().  Note  that  the  maximal  loss  label  yfIL  is  the 
input  to  the  maximal  loss  (it  is  possible  in  online  learning  since  the  loss  is 
computed  after  the  weight  vector  w(  is  chosen),  therefore  it  does  not  depend 
on  the  weight  vector  w  for  which  we  want  to  compute  the  loss.  So  the  maximal 
loss  function  only  concerns  the  particular  constraint  for  whether  the  true  label 
yt  is  scored  higher  than  the  maximal  loss  label  with  a  margin  of  A(yt,y fIL). 
This  is  the  key  difference  between  the  maximal  loss  and  the  max-margin  loss 
since  the  latter  looks  at  the  constraints  of  all  possible  labels.  The  main  draw¬ 
back  of  the  maximal  loss  is  that  finding  the  maximal  loss  label  yfIL,  which  is 
also  called  the  loss-augmented  inference  problem  (section  2.4),  is  only  feasible 
for  some  decomposable  label  loss  functions  such  as  Hamming  loss  since  the 
maximal  loss  label  depends  on  the  label  loss  function  A(yt,y').  This  is  the 
reason  why  we  want  to  consider  the  second  loss  function,  prediction-based  loss, 
which  can  be  used  with  any  label  loss  function  such  as  (1  —  F\)  loss. 

Prediction-based  loss  (PL)  function  This  loss  function  is  based  on 
the  predicted  label  yf  =  hWt(xt)  =  argmaxygY(wt,  4>(xt,  y)}: 

l pl (w,  (xt,yt))  = 

[A(yt,yf)  -  (w,  (0(xt,  yt)  -  0(xt,yf)))]  + 
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Like  the  maximal  loss,  the  prediction-based  loss  only  concerns  the  constraint 
for  the  prediction  label  yf.  We  have  lPL( wt,(xt,yt))  <  1Ml{ wt,(xt,yt))  since 
yfIL  is  the  maximal  loss  label  for  w*.  As  a  result,  the  update  based  on  the 
prediction-based  loss  function  is  less  aggressive  than  the  one  based  on  the 
maximal  loss  function.  However,  the  prediction-based  loss  function  can  be 
used  with  any  label  loss  function  since  the  predicted  label  yf  does  not  depend 
on  the  label  loss  function. 


9*(0)  = 


To  apply  the  primal-dual  algorithmic  framework  described  in  section 
2.5,  we  need  to  find  the  Fenchel  conjugate  function  of  the  complexity  func¬ 
tion  /( w)  and  the  loss  function  g( w).  The  Fenchel  conjugate  function  of  the 
complexity  function  /( w)  =  4||w|||  is  itself,  i.e.  f*(0)  =  {||0|||  (Boyd  &  Van- 
denberghe,  2004).  For  the  loss  function,  recall  that  the  Fenchel  conjugate 
function  of  the  Hinge-loss  g( w)  =  [7  -  (w,x)]+  is: 

—7 a  if  6  G  {—ccx  :  a  G  [0, 1]} 

00  otherwise 

(Appendix  A  in  ( Shale v- Shwartz  &  Singer,  2007a)).  We  can  see  that  both  the 
prediction-based  loss  and  the  maximal  loss  have  the  same  form  as  the  Hinge- 
loss  where  7  is  replaced  by  the  label  loss  function  /(yt,yf)  and  Z(yt,  yffL), 
and  x  is  replaced  by  A </>fL  =  <f>{xt,yt)  ~  </>(*t,  yf)  and  A <j>fL  =  <f>{xt,y t)  - 
yfIL)  for  the  prediction-based  loss  and  the  maximal  loss  respectively. 
Using  the  result  of  the  Hinge-loss,  we  have  the  Fenchel  conjugate  function  of 
the  prediction-based  loss  and  the  maximal  loss  as  follows: 

-A(yt,yf|ML)a  if  0  e  {-aA</>fL|ML  :  a  e  [0, 1]} 


9*t(0)  = 


00 


otherwise 
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The  next  step  is  to  derive  the  closed-form  solution  of  the  CDA  update  rule. 
The  optimization  problem  that  we  need  to  solve  is: 


argmaxAt  -  (, at)f * 


+  At 

(at) 


(5.1) 


where  =  Yl\=i  Substituting  the  conjugate  function  /*  and  rfi  as 

above  in  the  equation  5.1,  we  obtain  the  following  optimization  problem: 


arg  max  — 

ae[0,l] 


(at) 


\  A  A. PL\ML 

Al:(t_l)  —  cxA<pt 


(at) 


+  aA(yt,ypML) 


=  arg  max  —  a' 

ae[0,l] 


II  a  0; 


PL\MLi  1 2 


|Al:(t-l) 


2  (at) 


2  (at) 


+  a  ^A(y(,yf'ML)  +  -2—  (Ai:(£— I)5  A(/)fL'ML) 


This  objective  function  is  a  function  of  a  only  and  in  fact  it  is  a  concave 
parabola  whose  maximum  attains  at  the  point: 

iiA4>mn 

If  a*  G  [0, 1],  then  a*  is  the  maximizer  of  the  problem.  If  a*  <  0,  then  0  is  the 
maximizer  and  if  a*  >  1  then  1  is  the  maximizer.  In  summary,  the  solution  of 
the  above  optimization  is: 


a  = 


mm 


1, 


PL\ML\j 


HA*fi|Mi||! 
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To  obtain  the  update  in  terms  of  the  weight  vectors  w,  we  have: 


wm  =  vr 


i 

at 


1 


—a 


max 


at  at 

~(a(t  ~  l))w t 

at 


A0fL'ML) 


1 

—  min 
at 


{at)  A(yt,yf|ML) 


+  ^-(<r(t  -  l))wt,  A 

l|A0fi|ML||i 


A  0. 


PL  |  ML 


t  -  1 

Wt  + 


min 


erf  ’ 


iiA^nii 


A0 


PL\ML 

t 


The  new  method  is  summarized  in  Algorithm  5.1.  Interestingly,  this  up¬ 
date  formula  has  the  same  form  as  that  of  the  subgradient  algorithm  (Nathan  Ratliff 
&  Zinkevich,  2007)  which  is  derived  from  the  simple  update  criterion: 

wt+i  =  wt  -  — —  (o"W{  -  A0tML) 
at 

t  —  1  1  ML 

=  — — w  t  +  —A  <pt 

t  at 

The  key  difference  is  in  the  learning  rate.  The  learning  rate  of  the  subgradient 
algorithm,  which  is  equal  to  l/(crf),  does  not  depend  on  the  loss  suffered  at 
each  step,  while  the  learning  rate  of  CDA  is  the  minimization  of  1  / (ert)  and 
the  loss  suffered  at  each  step.  In  the  beginning,  when  t  is  small  and  therefore 
1  /{at)  is  large  (assuming  a  is  small),  CDA’s  learning  rate  is  controlled  by  the 
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Algorithm  5.1  Online  Coordinate-Dual- Ascent  Algorithms  for  Structured  Pre¬ 
diction _ 

1:  Parameters:  A  constant  a  >  0;  Label  loss  function  A(y,y/) 

2:  Initialize:  wj  =  0 
3:  for  i  =  1  to  T  do 
4:  Receive  an  instance  x* 

5:  Predict  yf  =  argmaxyeY(wt,  <£(xt,  y)) 

6:  Receive  the  correct  target  y t 

7:  (For  maximal  loss)  Compute  yfIL  =  argmaxyeY{A(y£,  y)  +  (w< ,  </>(xt,  y))} 

8:  Compute  A</>f: 

8:  PL:  A 4>t  =  0(xt,  yt)  -  </>(xt,  yf ) 

8:  ML:  A0t  =  <£(xt,yt)  -  0(xf,  ytML) 

9:  Compute  loss: 

9:  PL  (CDA):  lt  =  [A(yt,yf)  -  ^(wf,A0,)]  + 

9:  ML  (CDA):  Zt  =  [A(yt,yfL)  -  ^(wt,  A0t)]  + 

10:  Update: 

10:  CDA:  wm  =  ^wt  +  min{l/(at),  p^}A0t 

11:  end  for 


loss  suffered  at  each  step.  In  contrast,  when  t  is  large  and  therefore  l/(cr£) 
is  small,  then  the  learning  rate  of  CDA  is  driven  by  the  quantity  l/(crt).  In 
other  words,  at  the  beginning,  when  the  model  is  not  good,  CDA  aggressively 
updates  the  model  based  on  the  loss  suffered  at  each  step;  and  later  when  the 
model  is  good,  it  updates  the  model  less  aggressively. 

We  can  use  the  derived  CDA  algorithm  to  perform  online  weight  learn¬ 
ing  for  MLNs  since  the  weight  learning  problem  in  MLNs  can  be  cast  as  a 
max-margin  structured  prediction  problem  as  described  in  4.2. 
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5.3  Experimental  Evaluation 

In  this  section,  we  conduct  experiments  to  answer  the  following  ques¬ 
tions  in  the  context  of  MLNs: 

1.  How  does  our  new  online  learning  algorithm,  CDA,  compare  to  existing 
online  max-margin  learning  methods?  In  particular,  is  it  better  than  the 
subgradient  method  due  to  its  more  aggressive  update  in  the  dual? 

2.  How  does  it  compare  to  the  batch  max-margin  weight  learning  method 
developed  in  the  previous  chapter? 

3.  How  well  does  using  the  prediction-based  loss  compare  to  the  maximal 
loss  in  practice? 

5.3.1  Datasets 

We  ran  experiments  on  three  large,  real-world  datasets  with  thousands 
of  examples:  the  CiteSeer  dataset  for  bibliographic  citation  segmentation  de¬ 
scribed  in  4.4.1,  a  web  search  query  dataset  (Mihalkova  &  Mooney,  2009) 
obtained  from  Microsoft  Research  for  query  disambiguation,  and  the  CoNLL 
2005  dataset  (Carreras  &  Marquez,  2005)  for  Semantic  Role  Labeling.  We 
did  not  run  experiments  on  Alzheimer’s  datasets  and  WebKB  dataset  since 
those  are  datasets  with  a  few  mega-examples  (when  taking  into  account  the 
transitive  relationship,  each  Alzheimer’s  dataset  becomes  a  mega-example). 

For  the  search  query  disambiguation,  we  used  the  data  created  by  Mi¬ 
halkova  and  Mooney  (2009).  The  dataset  consists  of  thousands  of  search  ses- 
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sions  where  ambiguous  queries  are  asked.  The  data  are  split  into  3  disjoint 
sets:  training,  validation,  and  test.  There  are  4,  618  search  sessions  in  the 
training  set,  4,803  sessions  in  the  validation  set,  and  11,234  sessions  in  the 
test  set.  In  each  session,  the  set  of  possible  search  results  for  a  given  ambigu¬ 
ous  query  is  given,  and  the  goal  is  to  rank  these  results  based  on  how  likely 
it  will  be  clicked  by  the  user.  A  user  may  click  on  more  than  one  result  for  a 
given  query.  To  solve  this  problem,  Mihalkova  and  Mooney  (2009)  proposed 
three  different  MLNs  which  correspond  to  different  levels  of  information  used 
in  disambiguating  the  query.  We  used  all  three  MLNs  in  our  experiments.  In 
comparison  to  the  Citeseer  dataset,  the  search  query  dataset  is  larger  but  is 
much  noisier  since  a  user  can  click  on  a  result  because  it  is  relevant  or  because 
the  user  is  just  doing  an  exploratory  search. 

The  CoNLL  2005  dataset  contains  over  40,  000  sentences  from  Wall 
Street  Journal  (WSJ).  Given  a  sentence,  the  task  is  to  analyze  the  propositions 
expressed  by  some  target  verbs  of  the  sentence.  In  particular,  for  each  target 
verb,  all  of  its  semantic  components  must  be  identified  and  labeled  with  their 
semantic  roles  as  in  the  following  sentence  for  the  verb  accept. 

[ao  He]  [am -mod  would]  [am-neg  n’t]  [v  accept]  [Ai  anything  of  value] 
from  [a2  those  he  was  writing  about]. 

A  verb  and  its  set  of  semantic  roles  form  a  proposition  in  the  sentence,  and  a 
sentence  usually  contains  more  than  one  proposition.  Each  proposition  serves 
as  a  training  example.  The  dataset  consists  of  three  disjoint  subsets:  training, 
development,  and  test.  The  number  of  propositions  (or  examples)  in  the  train- 
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ing,  development,  and  test  sets  are:  90,750;  3,248;  and  5,267  respectively.1 
We  used  the  MLN  constructed  by  Riedel  (2008)  which  contains  clauses  that 
capture  the  features  of  constituents  and  dependencies  between  semantic  com¬ 
ponents  of  the  same  verb. 

5.3.2  Methodology 

To  answer  the  above  questions,  we  ran  experiments  with  the  following 
systems: 


MM:  The  offline  max-margin  weight  learner  for  MLNs  presented  in  previous 
chapter. 


1-best  MIRA:  MIRA  is  one  of  the  first  online  learning  algorithms  for  struc¬ 
tured  prediction  proposed  by  McDonald,  Crammer,  and  Pereira  (2005). 
A  simple  version  of  MIRA,  called  1-best  MIRA,  is  widely  used  in  practice 
since  its  update  rule  has  a  closed-form  solution.  1-best  MIRA  has  been 
used  in  previous  work  (Riedel  &  Meza-Ruiz,  2008)  to  learn  weights  for 
MLNs.  In  each  round,  it  updates  the  weight  vectors  w  as  follows: 


w  t+i  =  wt 


[A(yt,yf)  -  (wt,  A0fL)_ 
l|A0fL||i 


PL 


Subgradient:  This  algorithm  proposed  by  Nathan  Ratliff  and  Zinkevich 
(2007)  is  an  extension  of  the  Greedy  Projection  algorithm  (Zinkevich, 
2003)  to  the  case  of  structured  prediction. 


1We  only  used  the  WSJ  part  of  the  test  set. 
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CDA:  Our  newly  derived  online  learning  algorithm  presented  in  Algorithm 
5.1. 

Regarding  label  loss  functions,  we  use  Hamming  (HM)  loss  described 
in  section  4.3.2.  As  mentioned  earlier,  Hamming  loss  is  a  decomposable  loss 
function,  so  it  can  be  used  with  both  maximal  loss  and  prediction-based  loss. 
Since  F\  is  the  standard  evaluation  metric  for  the  citation  segmentation  task 
on  Citeseer,  we  also  considered  the  label  loss  function  (1  —  F\ )  (Joachims, 
2005).  However,  since  this  loss  function  is  not  decomposable,  we  can  only  use 
it  with  the  prediction-based  loss. 

In  training,  for  online  learning  algorithms,  since  the  algorithms  process 
one  example  at  a  time  it  is  feasible  to  use  the  exact  MPE  inference  method 
based  on  Integer  Linear  Programming  described  in  section  4.3.1  on  Citeseer 
and  web  search  query  datasets,  and  Cutting  Plane  Inference  on  the  CoNLL 
2005  dataset.  For  the  offline  weight  learner  MM,  we  use  the  approximate 
inference  algorithm  described  in  section  4.3.1  since  it  is  computationally  in¬ 
tractable  to  run  exact  inference  for  all  training  examples  at  once.  In  testing, 
we  use  MCSAT  to  compute  marginal  probabilities  for  the  web  search  query 
dataset  since  we  want  to  rank  the  query  results,  and  exact  MPE  inference  on 
the  other  two  datasets.  For  all  online  learning  algorithms,  we  ran  one  pass  over 
the  training  set  and  used  the  average  weight  vector  to  predict  on  the  test  set. 
For  CiteSeer,  we  ran  four-fold  cross-validation  (i.e.  leave  one  topic  out).  The 
parameter  cr  of  the  Subgradient  and  CDA  is  set  based  on  the  performance  on 
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the  validation  set  except  Citeseer  where  the  parameter  is  set  based  on  training 
performance. 

Like  previous  work,  for  citation  segmentation  on  Citeseer,  we  used  F\  at 
the  token  level  to  measure  the  performance  of  each  algorithm;  for  search  query 
disambiguation,  we  used  MAP  (Mean  Average  Precision)  which  measures  how 
close  the  relevant  results  are  to  the  top  of  the  ranking;  and  for  semantic  role 
labeling  on  CoNLL  2005,  we  used  F\  of  the  predicted  arguments  as  described 
by  Carreras  and  Marquez  (2005). 

For  testing  the  statistical  significance  between  the  performance  of  dif¬ 
ferent  algorithms,  we  use  McNemar’s  test  (Dietterich,  1998)  on  Citeseer  and 
a  two-sided  paired  t-test  on  the  web  search  query.  The  significance  level  was 
set  to  5%  (p-value  smaller  than  0.05)  for  both  cases. 

5.3.3  Results  and  Discussion 

Table  5.1  presents  the  F\  scores  of  different  algorithms  on  Citeseer.  On 
this  dataset,  the  CD  A  algorithm  with  maximal  loss,  CDA-ML-HM,  has  the 
best  F\  scores  across  four  folds.  These  results  are  statistically  significantly  bet¬ 
ter  than  those  of  subgradient  method.  So  aggressive  update  in  the  dual  results 
in  a  better  F\  scores.  The  F\  scores  of  CDA-ML-HM  are  a  little  bit  higher  than 
those  of  1-best-MIRA,  but  the  difference  is  not  significant.  Interestingly,  with 
the  possibility  of  using  exact  inference  in  training,  CDA  is  a  little  bit  more 
accurate  than  the  batch  max-margin  algorithm  (MM)  since  the  batch  learner 
can  only  afford  to  use  approximate  inference  in  training.  Other  advantages  of 
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Table  5.1:  F\  scores  on  CiteSeer  dataset.  Highest  iq  scores  are  shown  in  bold. 


Algorithms 

Constraint 

Face 

Reasoning 

Reinforcement 

MM-HM 

93.187 

92.467 

92.581 

95.496 

1-best-MIRA-HM 

90.982 

90.598 

93.124 

97.518 

1-best-MIRA-Fi 

89.764 

90.046 

93.200 

96.841 

Subgradient-HM 

90.957 

89.859 

91.505 

95.318 

CDA-PL-HM 

91.245 

90.992 

92.589 

96.516 

CDA-PL-Fi 

91.742 

92.368 

92.726 

96.994 

CDA-ML-HM 

93.287 

93.204 

93.448 

97.560 

online  algorithms  are  in  terms  of  training  time  and  memory.  Table  5.2  shows 
the  average  training  time  of  different  algorithms  on  this  dataset.  All  online 
learning  algorithms  took  on  average  about  12-13  minutes  for  training  while 
the  batch  one  took  an  hour  and  a  half  on  the  same  machine.  In  addition,  since 
online  algorithms  process  one  example  at  a  time,  they  use  much  less  memory 
than  batch  methods.  On  the  other  hand,  the  running  time  results  also  confirm 
that  the  new  algorithm,  CDA,  has  the  same  computational  complexity  as  other 
existing  online  methods.  Regarding  the  comparison  between  maximal  loss  and 
prediction-based  loss,  the  former  is  better  than  the  latter  on  this  dataset  due 
to  its  more  aggressive  updates.  For  prediction-based  loss  function,  there  is  not 
much  difference  between  using  different  label  loss  functions  in  this  case. 

Table  5.3  shows  the  MAP  scores  of  different  algorithms  on  the  Microsoft 
web  search  query  dataset.  The  first  row  in  the  table  is  from  Mihalkova  and 
Mooney  (Mihalkova  &  Mooney,  2009)  who  used  a  variant  of  the  structured 
perceptron  (Collins,  2002)  called  Contrastive  Divergence  (CD)  (Hinton,  2002) 
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Table  5.2:  Average  training  time  on  CiteSeer  dataset. 


Algorithms 

Average  training  time 

MM-HM 

90.282  min. 

1-best- MIRA-HM 

11.772  min. 

1-best-  MIRA- Fi 

11.768  min. 

Subgradient-HM 

12.655  min. 

CDA-PL-HM 

11.869  min. 

CDA-PL-Fi 

11.915  min. 

CDA-ML-HM 

12.887  min. 

Table  5.3:  MAP  scores  on  Microsoft  search  query  dataset.  Highest  MAP  scores 
are  shown  in  bold. 


Algorithms 

MLN1 

MLN2 

MLN3 

CD 

0.375 

0.386 

0.366 

1-best-MIRA-HM 

0.366 

0.375 

0.379 

Subgradient-HM 

0.374 

0.397 

0.396 

CDA-PL-HM 

0.382 

0.397 

0.398 

CDA-ML-HM 

0.380 

0.397 

0.397 

to  do  online  weight  learning  for  MLNs.  It  is  clear  that  the  CDA  algorithm  has 
better  MAP  scores  than  CD.  For  this  dataset,  we  were  unable  to  run  offline 
weight  learning  since  the  large  amount  of  training  data  exhausted  memory 
during  training.  The  1-best  MIRA  has  the  worst  MAP  scores  on  this  dataset. 
This  behavior  can  be  explained  as  follows.  From  the  update  rule  of  the  1-best 
MIRA  algorithm,  we  can  see  that  it  aggressively  updates  the  weight  vector 
according  to  the  loss  incurred  in  each  round.  Since  this  dataset  is  noisy, 
this  update  rule  leads  to  overfitting.  This  also  explains  why  the  subgradient 
algorithm  has  good  performance  on  this  data  since  its  update  rule  does  not 
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•CDA-ML-HM 


•1-best-M  IRA-HM 


*  Subgradient-HM 


Figure  5.1:  Learning  curve  on  CoNLL  2005 

depend  on  the  loss  incurred  in  each  round.  The  MAP  scores  of  the  CDA 
algorithms  are  not  significantly  better  than  that  of  the  subgradient  method, 
but  their  performance  is  more  consistent  across  the  three  MLNs.  Regarding 
the  loss  function,  the  MAP  scores  of  CDA-PL  and  CDA-ML  are  almost  the 
same. 

Figure  5.1  shows  the  learning  curve  of  three  online  learning  algorithms: 
CDA,  1-best  MIRA  and  subgradient  on  the  CoNLL  2005  dataset.  In  general, 
the  relative  accuracy  of  three  algorithms  is  similar  to  what  we  have  seen  on 


75 


0.75 


0.5  - 
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Percentage  of  noise 

-■-CDA-ML-HM  1-best-M  IRA-HM  — Subgradient-HM 

Figure  5.2:  F\  scores  on  noisy  CoNLL  2005 

Citeseer.  CDA  outperforms  the  subgradient  method  across  the  whole  learning 
curve.  In  particular,  at  30,000  training  examples,  about  1/3  of  the  training 
set,  the  Fi  score  of  CDA  is  already  better  than  the  that  of  the  subgradient 
method  trained  on  the  whole  training  set.  The  performance  of  CDA  and  1-best 
MIRA  are  comparable  to  each  other,  except  on  the  early  part  of  the  learning 
curve  (less  than  10,  000  examples)  where  the  F\  scores  of  CDA  are  about  1  to 
2  percentage  points  higher  than  those  of  1-best  MIRA. 

The  CoNLL  2005  dataset  was  carefully  annotated  by  experts  (Palmer, 
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Gildea,  &  Kingsbury,  2005),  which  is  a  time  consuming  and  expensive  process. 
Nowadays,  a  faster  and  cheaper  way  to  obtain  this  type  of  annotation  is  using 
crowdsourcing  services  such  as  Amazon  Mechanical  Turk,2  which  is  possible 
to  assign  annotation  jobs  to  thousands  of  people  and  get  results  back  in  a 
few  hours  (Snow,  O’Connor,  Jurafsky,  &  Ng,  2008).  ffowever,  a  downside 
of  this  approach  is  the  big  variance  in  the  quality  of  labels  obtained  from 
different  annotators.  As  a  result,  there  is  a  lot  of  noise  in  the  annotated 
data.  To  simulate  this  type  of  noisy  labeled  data,  we  introduce  random  noise 
to  the  CoNLL  2005  dataset.  At  p  percent  noise,  there  is  probability  p  that 
an  argument  in  a  proposition  is  swapped  with  another  argument  in  the  same 
proposition.  For  example,  an  argument  with  role  “A0”  may  be  swapped  to 
an  argument  with  role  “Al”  and  vice  versa.  Figure  5.2  shows  the  F\  scores  of 
the  above  three  online  learning  algorithms  on  noisy  CoNLL  2005  dataset  at 
various  levels  of  noise.  With  the  presence  of  noise,  CDA  is  the  most  accurate 
and  also  the  most  robust  to  noise  among  the  three  algorithms.  For  10%  noise 
and  higher,  CDA  is  significantly  better  than  the  other  two  methods.  The  Fj 
score  of  CDA  at  a  noise  level  of  50%  is  8.5%  higher  than  that  of  1-best  MIRA 
and  12.6%  higher  than  that  of  the  subgradient  method.  On  the  other  hand, 
comparing  with  the  F\  score  on  the  clean  dataset,  the  Fi  score  of  CDA  at  50% 
of  noise  only  drops  8.4  points  while  those  of  1-best  MIRA  and  subgradient  drop 
about  17.6  and  16.1  respectively.  In  addition,  the  F\  score  of  CDA  at  50% 
noise  is  higher  than  the  F\  score  of  1-best  MIRA  at  35%  noise  and  comparable 

2https : //www.mturk. com/mturk/ 
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to  the  F\  score  of  subgradient  method  at  20%  noise. 

In  summary,  our  new  online  learning  algorithm  CDA  has  generally 
better  accuracy  than  existing  max-margin  online  methods  for  structured  pre¬ 
diction  such  as  1-best  MIRA  and  the  subgradient  method  which  have  been 
shown  to  achieve  good  performance  in  previous  work.  In  particular,  CDA  is 
significantly  better  than  other  methods  on  noisy  datasets. 

5.4  Related  Work 

Online  learning  for  max-margin  structured  prediction  has  been  stud¬ 
ied  in  several  pieces  of  previous  work.  In  addition  to  those  mentioned  ear¬ 
lier,  a  family  of  online  algorithms  similar  to  the  1-best  MIRA,  called  passive- 
aggressive  algorithms,  was  presented  in  (Crammer,  Dekel,  Keshet,  Shalev- 
Shwartz,  &  Singer,  2006).  Another  piece  of  related  work  is  the  exponenti¬ 
ated  gradient  algorithm  (Bartlett,  Collins,  Taskar,  &  Me  Allester,  2005;  Collins 
et  al.,  2008)  which  also  performs  updates  based  on  the  dual  of  the  primal  prob¬ 
lem.  However,  the  dual  problem  in  (Bartlett  et  ah,  2005;  Collins  et  ah,  2008) 
is  more  complicated  and  expensive  to  solve  since  it  was  derived  based  on  the 
max-margin  loss,  Imm ■  As  a  result,  to  efficiently  solve  the  problem,  the  authors 
assume  that  each  label  y  is  a  set  of  parts  and  both  the  joint  feature  and  the 
label  loss  function  can  be  decomposed  into  a  sum  over  those  for  the  individual 
parts.  Even  under  this  assumption,  efficiently  computing  the  marginal  values 
of  the  part  variables  is  still  a  challenging  problem. 

In  the  context  of  online  weight  learning  for  MLNs,  one  related  algorithm 
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is  SampleRank  (Culotta,  2008)  which  uses  a  sampling  algorithm  to  generate 
samples  from  a  given  training  example  and  updates  the  weight  vector  whenever 
it  misranks  a  pair  of  samples.  So  unlike  traditional  online  learning  algorithms 
that  perform  one  update  per  example,  SampleRank  performs  multiple  updates 
per  example.  However,  the  performance  of  SampleRank  highly  depends  on 
the  sampling  algorithm,  and  which  sampling  algorithms  are  best  is  an  open 
research  question. 

The  issue  of  prediction-based  loss  versus  maximal  loss  has  been  dis¬ 
cussed  previously  (Crammer  et  al.,  2006;  Shalev-Shwartz,  2007),  but  no  ex¬ 
periments  have  been  conducted  to  compare  them  on  real-world  datasets. 

5.5  Chapter  Summary 

We  have  presented  a  comprehensive  study  of  online  weight  learning  for 
MLNs.  Based  on  the  primal-dual  framework,  we  derived  a  new  CDA  online 
algorithm  for  structured  prediction  and  applied  it  to  learn  weights  for  MLNs 
and  compared  it  to  existing  online  methods  on  three  large,  real-world  datasets. 
Our  new  algorithm  generally  achieved  better  accuracy  than  existing  online 
methods.  In  particular,  our  new  algorithm  is  more  accurate  and  robust  when 
training  data  is  noisy. 
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Chapter  6 


Online  Structure  Learning  for  MLNs 

6.1  Introduction 

In  the  previous  chapter,  we  derived  a  new  online  max-margin  weight 
learning  algorithm  for  MLNs.  However,  like  other  existing  online  algorithms, 
the  algorithm  assumes  the  input  MLN’s  structure  is  complete  and  only  updates 
the  weights.  In  practice,  the  input  structure  is  usually  incomplete,  so  it  should 
also  be  updated.  To  address  this  issue,  in  this  chapter,  we  propose  a  new 
algorithm  that  performs  both  online  structure  and  parameter  learning. 

The  remainder  of  the  paper  is  organized  as  follows.  Section  6.2  provides 
some  background  on  the  held  segmentation  task.  Section  6.3  presents  the 
proposed  algorithm  OSL.  Section  6.4  reports  the  experimental  evaluation  on 
two  real-world  datasets.  Section  6.5  discusses  the  related  work  and  section  6.6 
summarizes  the  chapter. 

6.2  Task 

In  this  chapter,  we  focus  on  an  information  extraction  task,  called  held 
segmentation  which  is  the  generalized  version  of  the  citation  segmentation  task 
described  in  previous  chapters.  Field  segmentation  is  an  instance  of  structured 
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prediction  problems  where  the  data  contain  many  examples  such  as  a  corpus 
of  documents.  In  this  task,  a  document  is  represented  as  a  sequence  of  tokens, 
and  the  goal  is  to  segment  the  document  into  fields  (i.e.  to  label  each  token  in 
the  document  with  a  held  label).  For  example,  in  segmenting  advertisements 
for  apartment  rentals  (Grenager  et  ah,  2005),  the  goal  is  to  segment  each 
advertisement  into  fields  such  as  Features ,  Neighborhood ,  Rent,  Contact ,  and 
so  on.  Below  are  descriptions  of  some  key  predicates  in  this  domain: 

•  Token(string,  position,  docID):  the  token  at  a  particular  position  in  a 
document  such  as  Token(Entirely,  PA,  AdOOl) 

•  N  ext{position,  position ):  the  later  position  is  next  to  the  former  position 
such  as  Next(P01,  P02) 

•  LessThan(position, position):  the  former  position  is  appeared  before 
the  later  position  such  as  LessThan(P0l,  P05) 

•  InField(field, position,  docID):  the  held  label  of  the  token  at  a  partic¬ 
ular  position  in  a  document  such  as  FnField{Features,  PA,  ddOOl) 

Only  InField  is  the  target  predicate  and  the  rest  are  evidence  predicates. 

6.3  Online  Max-Margin  Structure  and  Parameter  Learn¬ 
ing 

In  this  section,  we  describe  OSL — the  new  online  max-margin  learning 
algorithm  for  updating  both  the  structure  and  parameters  of  an  MLN.  In  each 
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step,  whenever  the  model  makes  wrong  predictions  on  a  given  example,  based 
on  the  wrongly  predicted  atoms  OSL  finds  new  clauses  that  discriminate  the 
ground-truth  possible  world  from  the  predicted  one,  then  uses  an  adaptive 
subgradient  method  with  l  \  -regularization  to  update  weights  for  both  old  and 
new  clauses.  Algorithm  6.1  gives  the  pseudocode  for  OSL.  Lines  3  to  20 
are  pseudocode  for  structure  learning  and  lines  21  to  35  are  pseudocode  for 
parameter  learning. 

6.3.1  Online  Max-Margin  Structure  Learning  with  Mode-Guided 
Relational  Pathfinding 

Most  existing  structure  learning  algorithms  for  MLNs  only  consider 
ground-truth  possible  worlds  and  search  for  clauses  that  improve  the  likeli¬ 
hood  of  those  possible  worlds.  However,  these  approaches  may  spend  a  lot  of 
time  searching  over  insignificant  clauses  that  are  likely  true  in  most  possible 
worlds.  Therefore,  instead  of  only  considering  ground-truth  possible  worlds, 
OSL  also  takes  into  account  the  predicted  possible  worlds  which  are  the  most 
probable  possible  worlds  predicted  by  the  model.  At  each  step,  if  the  pre¬ 
dicted  possible  world  is  different  from  the  ground-truth  one,  then  OSL  focuses 
on  where  the  two  possible  worlds  differ  and  searches  for  clauses  that  differen¬ 
tiate  them.  This  is  related  to  the  idea  of  using  implicit  negative  examples  in 
(Zellc,  Thompson,  Califf,  &  Mooney,  1995).  In  this  case,  each  ground-truth 
possible  world  plays  the  role  of  a  positive  example  in  traditional  1LP.  Making 
a  closed  world  assumption  (Genesereth  &  Nilsson,  1987),  any  possible  world 
that  differs  from  the  ground-truth  possible  world  is  incorrect  and  can  be  con- 
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siclered  as  a  negative  example  (the  predicted  possible  world  in  this  case).  In 
addition,  this  follows  the  max-margin  training  criterion  (section  2.4)  which 
discriminates  the  true  label  (the  ground-truth  possible  world)  from  the  closest 
incorrect  one  (the  predicted  possible  world). 

At  each  time  step  t,  OSL  receives  an  example  xi;  produces  the  predicted 
label  yf  =  argmaxygY(we, TOe(xt, y)),  then  receives  the  true  label  yt.  Given 
yt  and  yf ,  in  order  to  find  clauses  that  separate  yf  from  yf ,  OSL  first  finds 
atoms  that  are  in  yt  but  not  in  yf ,  A yt  =  y t\  yf .  Then  OSL  searches  the 
ground-truth  possible  world  (xt,  y t)  for  clauses  that  are  specihc  to  the  true 
ground  atoms  in  A yt. 

A  simple  way  to  find  useful  clauses  specihc  to  a  set  of  atoms  is  to  use 
relational  pathfinding  (Richards  &  Mooney,  1992),  which  considers  a  relational 
example  as  a  hypergraph  with  constants  as  nodes  and  true  ground  atoms  as 
hyperedges  connecting  the  nodes  that  are  its  arguments,  and  searches  in  the 
hypergraph  for  paths  that  connect  the  arguments  of  an  input  literal.  A  path 
of  hyperedges  corresponds  to  a  conjunction  of  true  ground  atoms  connected  by 
their  arguments  and  can  be  generalized  into  a  first-order  clause  by  variabilizing 
their  arguments.  Starting  from  a  given  atom,  relational  pathfinding  searches 
for  all  paths  connecting  the  arguments  of  the  given  atom.  Therefore,  relational 
pathfinding  may  be  very  slow  or  even  intractable  when  there  are  a  large  (or 
exponential)  number  of  paths.  To  speed  up  relational  pathfinding,  we  use  mode 
declarations  (Muggleton,  1995)  to  constrain  the  search  for  paths.  As  defined  in 
(Muggleton,  1995),  mode  declarations  are  a  form  of  language  bias  to  constrain 
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the  search  for  definite  clauses.  Since  our  goal  is  to  use  mode  declarations  for 
constraining  the  search  space  of  paths,  we  introduce  a  new  mode  declaration: 
modep(r,  p)  for  paths,  ft  has  two  components:  a  recall  number  r  which  is  a 
positive  interger,  and  an  atom  p  whose  arguments  are  place-makers.  A  place- 
maker  is  either  ‘+’  (input),  ‘-’  (output),  or  V  (don’t  explore).  The  recall 
number  r  limits  the  number  of  appearances  of  the  predicate  p  in  a  path  to 
r.  The  place-maker  restricts  the  search  of  relational  pathfinding.  Only  paths 
connecting  ‘input’  or  ‘output’  nodes  will  be  considered.  A  ground  atom  can 
only  added  to  a  path  if  one  of  its  arguments  has  previously  appeared  as  ‘input’ 
or  ‘output’  arguments  in  the  path  and  all  of  its  ‘input’  arguments  are  ‘output’ 
arguments  of  previous  atoms.  Here  are  some  examples  of  mode  declarations 
for  paths: 

modep{ 2,  Token{ .,  +, .))  modep(  1,  Next( — ,  — ))  modep( 2,  TnField{.,  — , .) 

The  above  mode  declarations  require  that  a  legal  path  contains  at  most  two 
ground  atoms  of  each  predicate  Token  and  InField  and  one  ground  atom 
of  predicate  Next.  Moreover,  the  second  argument  of  Token  is  an  ‘input’ 
argument;  the  second  argument  of  InField  and  all  arguments  of  Next  are 
‘output’  arguments.  Note  that,  in  this  case,  all  ‘input’  and  ‘output’  argu¬ 
ments  are  of  type  position.  These  ‘input’  and  ‘output’  modes  constrain  that 
the  position  constants  in  atoms  of  Token  must  appeared  in  some  previous 
atoms  of  Next  or  InField  in  a  path.  From  the  graphical  model  perspec¬ 
tive,  these  mode  declarations  restrict  the  search  space  to  linear  chain  CRFs 
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(Sutton  &  McCallum,  2007)  since  they  constrain  the  search  to  paths  connect¬ 
ing  ground  atoms  of  two  consecutive  tokens.  It  is  easy  to  modify  the  mode 
declarations  to  search  for  more  complicated  structure.  For  example,  if  we  in¬ 
crease  the  recall  number  of  Next  to  2  and  the  recall  number  of  InField  to 
3,  then  search  space  is  constrained  to  second-order  CRFs  since  they  constrain 
the  searches  to  paths  connecting  ground  atoms  of  three  consecutive  tokens. 
If  we  add  a  new  mode  declaration  modep(l,  LessThan(—,  — ))  for  predicate 
LessThan ,  then  the  search  space  becomes  skip-chain  CRFs  (Sutton  &  McCal¬ 
lum,  2007).  Algorithm  6.2  presents  the  pseudocode  for  efficiently  constructing 
a  hypergraph  based  on  mode  declarations  by  only  constructing  the  hypergraph 
corresponding  to  input  and  output  nodes.  Algorithm  6.3  gives  the  pseudocode 
for  mode-guided  relational  pathfinding,  ModeGuidedFindPaths,  on  the  con¬ 
structed  hypergraph.  It  is  an  extension  of  a  variant1  of  relational  pathfinding 
presented  in  (Kok  &  Domingos,  2009).  Starting  from  each  true  ground  atom 
r(ci, ...,  cr)  G  A yt,  it  recursively  adds  to  the  path  ground  atoms  or  hyper¬ 
edges  that  satisfy  the  mode  declarations.  Its  search  terminates  when  the  path 
reaches  a  specified  maximum  length  or  when  no  new  hyperedge  can  be  added. 
The  algorithm  stores  all  the  paths  encountered  during  the  search.  Below  is  an 
example  path  found  by  the  algorithm: 

{InField(Size,  P29,  AdOOl),  T oken(And ,  P29,  AdOOl),  Next(P29,  P30), 
Token(Spacious ,  P30,  AdOOl)  InField(Size ,  P30,  AdOOl)} 

An  this  variant,  a  path  doesn’t  need  to  connect  arguments  of  the  input  atom.  The  only 
requirement  is  that  any  two  consecutive  atoms  in  a  path  must  share  at  least  one  argument. 
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A  standard  way  to  generalize  paths  into  first-order  clauses  is  to  replace  each 
constant  q  in  a  conjunction  with  a  variable  vt.  However,  for  many  tasks  such 
as  held  segmentation,  it  is  critical  to  have  clauses  that  are  specific  to  a  partic¬ 
ular  constant.  In  order  to  create  clauses  with  constants,  we  introduce  mode 
declarations  for  creating  clauses:  modec(p).  This  mode  declaration  has  only 
one  component  which  is  an  atom  p  whose  arguments  are  either  ‘c’  (constant) 
or  V  (variable).  Below  are  some  examples  of  mode  declarations  for  creating 
clauses: 

modec(T oken{c ,  v,  v))  modec(Next(v,  v))  modec(InField{c ,  v,  v) 

Based  on  these  mode  declarations,  OSL  variablises  all  constants  in  a  conjunc¬ 
tion  except  those  are  declared  as  constants.  Then  OSL  converts  the  conjunc¬ 
tion  of  positive  literals  to  clausal  form  since  this  is  the  form  used  in  Alchemy. 
In  MLNs,  a  conjunction  of  positive  literals  with  weight  w  is  equivalent  to  a 
clause  of  negative  literals  with  weight  —w.  Previous  work  (Kok  &  Domingos, 
2009,  2010)  found  that  it  is  also  useful  to  add  other  variants  of  the  clause 
by  hipping  the  signs  of  some  literals  in  the  clause.  Currently,  we  only  add 
one  variant — a  Horn  version  of  the  clause  by  only  hipping  the  hrst  literal, 
the  one  for  which  the  model  made  a  wrong  prediction.  In  summary,  for  each 
path,  OSL  creates  two  type  of  clauses:  one  with  all  negative  literals  and  one 
in  which  only  the  hrst  literal  is  positive.  For  example,  from  the  sample  path 
above,  OSL  creates  the  following  two  clauses: 
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-i <InField(Size,pl,a )  V  ->Token(And,pl,a )  V  ->Next(pl,p2)  V 
-i' Token(Spacious,p2,a )  V  - JnField(Size,p2,a ) 

InField(Size,pl,a)  V  ->Token(And,pl,a)  V  ~<Next(pl,p2)  V 
-i Token(Spacious,p2,a )  V  -i InField(Size,p2,a ) 


Finally,  for  each  new  clause  c,  OSL  computes  the  difference  in  number 
of  true  groundings  of  c  in  the  ground-truth  possible  world  (xt,yt)  and  the 
predicted  possible  world  (xt,yf),  A nc  =  nc(xt,  yt)  —  nc(xt,  yf).  Then,  only 
clauses  whose  difference  in  number  of  true  groundings  is  greater  than  or  equal 
to  a  predefined  threshold  minC ountDif f  will  be  added  to  the  existing  MLN. 
The  smaller  the  value  of  minC ountDif  f ,  the  more  clauses  will  be  added  to 
the  existing  MLN  at  each  step. 


6.3.2  Online  Max-Margin  /x-regularized  Weight  Learning 


The  above  online  structure  learner  may  introduce  a  lot  of  new  clauses 
in  each  step,  and  some  of  them  may  not  be  useful  in  the  long  run.  To  ad¬ 
dress  this  issue,  like  in  section  3.2.2,  we  use  l\ -regularization  but  in  an  on¬ 
line  setting.  We  employ  a  state-of-the-art  online  ^-regularization  method — 
ADAGRAD_FB  which  is  a  ^-regularized  adaptive  subgradient  method  using 
composite  mirror-descent  update  (Duchi,  Hazan,  &  Singer,  2010).  At  each 
time  step  t,  it  updates  the  weight  vector  as  follows: 


wt+1)i  =  sign 


(  rl  \ 

V 

\g 

Wt,i  TJ  St,i 

Ht,ii  _ 

(6.1) 
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Algorithm  6.1  OSL 

Input:  C:  initial  clause  set  (can  be  empty) 

mode:  mode  declaration  for  each  predicate 
maxLen :  maximum  number  of  hyperedges  in  a  path 

minC ountDif  f :  minimum  number  of  difference  in  true  groundings  for  selecting  new 
clauses 

X,r],S:  parameters  for  the  h-regularization  adaptive  subgradient  method 
p(y,y'):  label  loss  function 

Note:  Index  H  maps  from  each  node  7 *  to  set  of  hyperedges  r(7i, 7 7„)  containing 
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7* 

Paths  is  a  set  of  paths,  each  path  is  a  set  of  hyperedges 
Initialize:  we  =  0,gp  =  0,?ic  =  |C| 
for  *  =  1  to  T  do 

Receive  an  instance  xt 

Predict  yf  =  argmaxyeY(we,ne(xt,y)) 

Receive  the  correct  target  yt 
Compute  A  yt  =  yt\  yf 
if  A yt  ^  0  then 

H  =  CreateHG((xt,yt),mode) 

Paths  =  0 

for  each  true  atom  r(ci, ... ,cr )  €  A yf  do 

V  =  0 

for  each  q  €  {ci, ...,  cr}  do 

if  isInputOrOutputVar(cj,mode)  then 

V  =  VU{a} 

end  if 
end  for 

ModeGuidedFindPaths{{r{c\, ...,  cr)},  V,  H ,  mode,  maxLen,  Paths) 

end  for 
end  if 

&new  =  CreateClauses(G,  Paths,  mode) 

Compute  Ane,  Ane„e„: 

A ne  =  ne(xt,  yt)  -  ne(xt,yf) 

Ane„e„  =  nenew(xt, yt)  -  ne„e„(xt,yf) 
for  i  =  1  to  |  C  |  do 

ge,»  =  g e,i  +  A ne,i  *  A ne,i 

we,i  =  sign  (w e>i  +  ,5+ Awe,i)  wCjl  +  s+^—Ane,i  -  s+%r~ 

end  for 

for  i  =  1  to  |  Gnew  |  do 

if  Ane„e„  i  >  minC  ountDif  fer  then 

e  =  e  u  eneilM 

nc  =  ?rc  +  1 

gp  =  Anp  .  *  Anp 

We>-  =  [^+v4^(Ane— «  -  A). 

end  if 
end  for 

end  for  88 


+ 
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Algorithm  6.2  CreateH G(D, rno de) 

Input:  D:  a  relational  example 

mode:  mode  declaration  file 
1:  for  each  constant  c  in  I)  do 
2:  H[c\  =  0 

3:  end  for 

4:  for  each  true  ground  atom  r(ci,  ...,cr)  £  D  do 
5:  for  each  constant  Ci  £  {ci,  ...,cr}  do 

6:  if  isInputOrOutputVar(cj,mode)  then 

7:  H[a ]  =  H[a\  U  {r(ci,  ...,cr)} 

8:  end  if 

9:  end  for 

10:  end  for 
11:  return  H 


Algorithm  6.3  ModeGuidedFindPaths(Curr Path,  V,  H,  mode,  maxLen,  Paths) 
1:  if  \CurrPath\  <  maxLen  then 
2:  for  each  constant  c  £  V  do 

3:  for  each  r(ci, ... ,cr )  £  H[c]  do 

4:  if  canBeAdded(r(ci,  ...,cr),  CurrPath,  mode)  ==  success  then 

5:  if  CurrPath  ^  Paths  then 

6:  CurrPath  =  CurrPath  U  {r(ci, ...,  cr)} 

7:  Paths  =  Paths  U  {CurrPath} 

8:  V'  =  0 

9:  for  each  c*  £  {ci,  ...,cr}  do 

10:  if  Ci  and  isInputOrOutputVar(cj,mode)  then 

11:  f  =  VU  {a} 

12:  V'  =  V'  U  {cj 

13:  end  if 

14:  end  for 

15:  ModeGuidedFindPaths(CurrPath,  V,  H,  mode,  maxLen,  Paths) 

16:  CurrPath  =  CurrPath  \  {?’(ci, ...,  cr)} 

17:  V  =  V\V’ 

18:  end  if 

19:  end  if 

20:  end  for 

21:  end  for 

22:  end  if 


89 


where  A  is  the  regularization  parameter,  77  is  the  learning  rate,  g t  is  the  subgra¬ 
dient  of  the  loss  function  at  step  t,  and  Ht  ii  =  5  + 1  |gi:*,»|  |2  =  <5+  y/^*=1(gj,j)2 
(5  >  0).  Note  that,  ADAGRAD_FB  assigns  a  different  step  size,  for  each 
component  of  the  weight  vectors.  Thus,  besides  the  weights,  A  DAG  RAD  FB 
also  needs  to  retain  the  sum  of  the  squared  subgradients  of  each  component. 

From  the  equation  6.1,  we  can  see  that  if  a  clause  is  not  relevant  to  the 
current  example  (i.e.  gt,t  =  0)  then  ADAGRAD_FB  discounts  its  weight  by 
tf2-.  Thus,  irrelevant  clauses  will  be  zeroed  out  in  the  long  run. 

Regarding  the  loss  function,  we  use  the  prediction-based  loss  function 
l pl  described  in  section  5.2: 

JpL(we,(xt,yt))  =  [p(yt,yf)  -  (we,  (ne(xt,yt)  -  ne(xt,yf )))]  + 

The  subgradient  of  Ipl  is: 

g pl  =  ne(xf,yfL)  -  ne(xt,yt)  =  -  [ne(xt,yt)  -  ne(xt,yfL)]  =  -Ane 

Substituting  the  gradient  into  equation  6.1,  we  obtain  the  following 
formulae  for  updating  the  weights  of  old  clauses: 

ge,i  =  ge,i  +  (Ane,.)2 

w e,i  <-  sign  (  w e  i  +  11 _ A rae,i  )  we  i  +  -  11 _ A ne,i 

For  new  clauses,  the  update  formulae  are  simpler  since  all  the  previous  weights 
and  gradients  are  zero: 

§e,nc  (Ane,,^;)' 
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Lines  24  —  27  in  Algorithm  6.1  are  the  pseudocode  for  updating  the  weights  of 
existing  clauses,  and  lines  28  —  35  are  the  pseudocode  for  selecting  and  setting 
weights  for  new  clauses. 

6.4  Experimental  Evaluation 

In  this  section,  we  conduct  experiments  to  answer  the  following  ques¬ 
tions: 

1.  Starting  from  a  given  MLN,  does  OSL  find  new  useful  clauses  that  im¬ 
prove  the  predictive  accuracy? 

2.  How  well  does  OSL  perform  when  starting  from  an  empty  knowledge 
base? 

3.  How  does  OSL  compare  to  LSM,  the  state-of-the-art  structure  learner 
for  MLNs  (Kok  &  Domingos,  2010)  ? 

6.4.1  Data 

We  ran  experiments  on  two  real  world  datasets  for  field  segmentation: 
CiteSeer  described  in  section  4.4.1  and  the  advertisements  dataset  Craigslist 
(Grenager  et  ah,  2005). 

The  Craigslist  dataset2  consists  of  advertisements  for  apartment  rentals 

2http :  //nip  .  stanf  ord.  ed.il/~grenager/data/unsupie  .  tgz 
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posted  on  Craigslist.  There  are  8,  767  ads  in  the  dataset,  but  only  302  of  them 
were  labeled  with  11  fields:  Available,  Address,  Contact,  Features,  Neighbor¬ 
hood,  Photos,  Rent,  Restrictions,  Roommates,  Size,  and  Utilities.  The  labeled 
ads  are  divided  into  3  disjoint  sets:  training,  development  and  test  set.  The 
number  of  ads  in  each  set  are  102,  100,  and  100  respectively.  We  preprocessed 
the  data  using  regular  expressions  to  recognize  numbers,  dates,  times,  phone 
numbers,  URLs,  and  email  addresses. 

6.4.2  Input  MLNs 

A  standard  model  for  sequence  labeling  tasks  such  as  field  segmentation 
is  a  linear  chain  CRF  (Lafferty  et  al.,  2001).  Thus,  we  use  a  linear  chain  CRF 
as  the  input  MLN.  The  following  MLN,  named  LC_0,  encodes  a  simple  linear 
chain  CRF  that  only  use  the  current  words  as  features: 

Token(+t,p,c )  =t-  InField(+f,p,c ) 

Next(pl,p2)  A  FnField(+fl,pl,  c)  =>  InField(+f2,p2,c) 
InField(fl,p,c)  A  (/l!  =  /2)  -i LnField(f2,p,c ). 

The  plus  notation  indicates  that  the  MLN  contains  an  instance  of  the  first 
clause  for  each  (token,  field )  pair,  and  an  instance  of  the  second  clause  for 
each  pair  of  fields.  Thus,  the  first  set  of  rules  captures  the  correlation  between 
tokens  and  fields,  and  the  second  set  of  rules  represents  the  transitions  between 
fields.  The  third  rule  constrains  that  the  token  at  a  position  p  can  be  part  of 
at  most  one  field. 
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For  CiteSeer,  there  is  also  an  existing  MLN  previously  mentioned  in 
4.4.1,  which  is  called  the  isolated  segmentation  model  (ISM).  ISM  is  also  a 
linear  chain  CRF  but  has  more  features  than  the  simple  linear  chain  CRF 
above.  Like  LC_0,  ISM  also  has  rules  that  correlate  the  current  words  with 
held  labels.  For  transition  rules,  ISM  only  captures  transitions  within  fields 
and  also  takes  into  account  punctuation  as  held  boundaries: 

Next(pl,p2)  A  -■ HasPunc(pl,c )  A  InField(+f,pl,c)  =>  InField(+f,p2,c ) 

In  addition,  ISM  also  contains  rules  that  are  specific  to  the  citation  domain 
such  as  “the  first  two  positions  of  a  citation  are  usually  in  the  author  held”, 
“initials  tend  to  appear  in  either  the  author  or  the  venue  held”.  Most  of 
those  rules  are  features  corresponding  to  words  that  appear  before  or  after  the 
current  tokens. 

For  Craigslist,  previous  work  (Grenager  et  al.,  2005)  found  that  it  is 
useful  to  only  capture  the  transitions  within  helds  and  take  into  account  the 
held  boundaries,  so  we  create  a  version  of  ISM  for  it  by  removing  all  clauses 
that  are  specihc  to  the  citation  domain.  Thus,  the  ISM  MLN  for  Craiglist  is 
a  revised  version  of  the  LC_0  MLN.  Therefore,  we  only  ran  experiments  with 
ISM  on  Craigslist. 

6.4.3  Methodology 

To  answer  the  questions  above,  we  ran  experiments  with  the  following 
systems: 


93 


ADAGRAD  FB-LC  O:  Use  ADAGRAD  _FB  to  learn  weights  for  the  LChO 
MLN. 

OSL-M1-LC_0:  Starting  from  the  LC_0  MLN,  this  system  runs  a  slow  ver¬ 
sion  of  OSL  where  the  parameter  minCountDiff  is  set  to  1,  i.e.  all 
clauses  whose  number  of  true  groundings  in  true  possible  worlds  is  greater 
than  those  in  predicted  possible  worlds  will  be  selected. 

OSL-M2-LC_0:  Starting  from  the  LC_0  MLN,  this  system  runs  a  faster 
version  of  OSL  where  the  parameter  minCountDiff  is  set  to  2. 

ADAGRAD  FB-ISM:  Use  ADAGRAD  FB  to  learn  weights  for  the  ISM 
MLN. 

OSL-M1-ISM:  Like  OSL-M1-LC_0,  but  starting  from  the  ISM  MLN. 

OSL-M2-ISM:  Like  OSL-M2-LC_0,  but  starting  from  the  ISM  MLN. 

OSL-Ml-Empty:  Like  OSL-M1-LC_0,  but  starting  from  an  empty  MLN. 

OSL-M2-Empty:  Like  OSL-M2-LC_0,  but  starting  from  an  empty  MLN. 

Regarding  label  loss  functions,  we  use  Hamming  (HM)  loss  described  in  section 
4.3.2. 

For  inference  in  training  and  testing,  we  used  the  exact  MPE  inference 
method  based  on  Integer  Linear  Programming  described  in  section  4.3.1.  For 
all  systems,  we  ran  one  pass  over  the  training  set  and  used  the  average  weight 
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vector  to  predict  on  the  test  set.  For  Craigslist,  we  used  the  original  split  for 
training  and  test.  For  CiteSeer,  we  ran  four-fold  cross-validation  (i.e.  leave  one 
topic  out).  The  parameters  A,  rj,  6  of  ADAGRADAFB  were  set  to  0.001,1,  and  1 
respectively.  For  OSL,  the  mode  declarations  were  set  to  constrain  the  search 
space  of  relational  pathfinding  to  linear  chain  CRFs  in  order  to  make  exact 
inference  in  training  feasible  3;  the  maximum  path  length  maxLen  was  set  to 
4;  the  parameters  A,  rj,  5  were  set  to  the  same  values  in  ADAGRAD_FB.  All  the 
parameters  are  set  based  on  the  performance  on  the  Craigslist  development 
set.  We  used  the  same  parameter  values  on  CiteSeer. 

Like  previous  work,  we  used  F\  to  measure  the  performance  of  each 
system. 

6.4.4  Results  and  Discussion 

Table  6.1  shows  the  average  F\  with  their  standard  deviations,  average 
training  times  in  minutes,  and  average  number  of  non-zero  clauses  on  CiteSeer. 
All  results  are  averaged  over  the  four  folds.  First,  either  starting  from  LC_0 
or  ISM,  OSL  is  able  to  find  new  useful  clauses  that  improve  the  F\  scores.  For 
LC_0,  comparing  to  the  system  that  only  does  weight  learning,  the  fast  version 
of  OSL,  OSL-M2,  increases  the  average  F\  score  by  9.4  points,  from  82.62  to 
92.05.  The  slow  version  of  OSL,  OSL-M1,  further  improves  the  average  F\ 
score  to  94.47.  For  ISM,  even  though  it  is  a  well-developed  MLN,  OSL  is 

3We  did  try  to  search  on  a  more  complex  space  such  as  second-order  CRFs,  but  it  took 
much  longer  time  in  training  with  minimal  improvement  in  the  F\  score. 
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Table  6.1:  Experimental  results  on  CiteSeer. 


Systems 

Average  F\ 

Average  training 
time  (minutes) 

Average  number  of 
non-zero  clauses 

ADAGRAD_FB-LC_0 

82.62  ±2.12 

10.40 

2,896 

OSL-M2-LC_0 

92.05  ±2.63 

14.16 

2,150 

OSL-Ml-LC-O 

94.47  ±  2.04 

163.17 

9,395 

ADAGRAD_FB-ISM 

91.18  ±3.82 

11.20 

1,250 

OSL-M2-ISM 

95.51  ±2.07 

12.93 

1,548 

OSL-M1-ISM 

96.48  ±1.72 

148.98 

8,476 

OSL-M2-Empty 

88.94  ±3.96 

23.18 

650 

OSL-Ml-Empty 

94.03  ±2.62 

257.26 

15,212 

Table  6.2:  Experimental  results  on  Craigslist. 


Systems 

Fi 

Training  time 
(minutes) 

Number  of 
non-zero  clauses 

ADAGRAD  JFB-ISM 

79.57 

2.57 

2,447 

OSL-M2-ISM 

77.26 

3.88 

2,817 

OSL-M1-ISM 

81.58 

33.63 

9,575 

OSL-M2-Empty 

55.28 

17.64 

1,311 

OSL-Ml-Empty 

71.23 

75.84 

17,430 

still  able  to  enhance  it.  The  OSL-M1-ISM  achieves  the  best  average  Ej  score, 
96.48,  which  is  2  points  higher  than  the  current  best  F\  score  achieved  by 
using  a  complex  joint  segmentation  model  that  also  uses  information  from 
matching  multiple  citations  of  the  same  paper  (Poon  &  Domingos,  2007). 
On  the  other  hand,  the  results  of  OSL-M2-Empty  and  OSL-Ml-Empty  shows 
that  OSL  also  perform  very  well  when  there  is  no  input  MLN.  OSL-M1  even 
finds  a  structure  that  has  higher  predictive  accuracy  than  that  of  ISM.  All  of 
the  differences  in  F\  score  between  OSL  and  A  DAG  RAD  FL  are  statistically 
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significant  according  to  a  paired  t-test  at  significance  level  0.05.  Regarding 
the  training  time,  OSL-M2  takes  on  average  a  few  more  minutes  than  systems 
that  only  do  weight  learning.  However,  OSL-M1  takes  more  time  to  train  since 
including  more  new  clauses  results  in  longer  time  for  constructing  the  ground 
network,  running  inference,  and  computing  the  number  of  true  groundings. 
The  last  column  of  Table  6.1  shows  the  average  number  of  non- zero  clauses  in 
the  final  MLNs  learnt  by  different  systems.  These  numbers  reflect  the  size  of 
MLNs  generated  by  different  systems  during  training. 

Table  6.2  shows  the  experimental  results  on  Craigslist.  The  segmen¬ 
tation  task  in  Craigslist  is  much  harder  than  the  one  in  CiteSeer  due  to  the 
huge  variance  in  the  context  of  different  ads.  As  a  result,  most  words  only 
appear  once  or  twice  in  the  training  set.  Thus  the  most  important  rules  are 
those  that  correlate  words  with  fields  and  those  capturing  the  regularity  that 
consecutive  words  are  usually  in  the  same  field,  which  are  already  in  ISM.  In 
addition,  most  rules  only  appear  once  in  a  document.  That’s  why  OSL-M2  is 
not  able  to  find  useful  clauses,  but  OSL-M1  is  able  to  find  some  useful  clauses 
that  improve  the  F\  score  of  ISM  from  79.57  to  81.58.  On  the  other  hand, 
OSL  also  gives  some  promising  results  when  starting  from  an  empty  MLN. 

To  answer  question  3,  we  ran  LSM  on  CiteSeer  and  Craigslist  but  the 
MLNs  returned  by  LSM  result  in  huge  ground  networks  that  make  weight 
learning  infeasible  even  using  online  weight  learning.  The  problem  is  that  these 
natural  language  problems  have  a  huge  vocabulary  of  words.  Thus,  failing  to 
restrict  clauses  to  specific  words  results  in  a  blow-up  in  the  size  of  the  ground 
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network.  However,  LSM  is  currently  not  able  to  learn  clauses  with  constants. 
It  is  unclear  whether  it  is  feasible  to  make  LSM  efficiently  learn  clauses  with 
constants  since  those  constants  may  need  to  be  considered  individually  which 
dramatically  increases  the  search  space.  This  problem  also  holds  for  other 
existing  structure  learners  of  MLNs  (Kok  &  Domingos,  2005;  Mihalkova  & 
Mooney,  2007;  Biba  et  ah,  2008;  Kok  &  Domingos,  2009).  Our  previous 
structure  learner  described  in  chapter  3  can  learn  clauses  with  constants  but 
it  can  only  learn  non-recursive  clauses.  Thus,  it  is  not  suitable  for  the  field 
segmentation  task. 

Below  are  some  good  clauses  found  by  OSL-M2-ISM  on  CiteSeer: 

•  If  the  current  token  is  in  the  Title  field  and  it  is  followed  by  a  period 
then  it  is  likely  that  the  next  token  is  in  the  Venue  field. 

-i InField(Ftitle,pl,c )  V  -i FollowBy(pl,TPERIOD,  c )  V 
->Next(pl,p2)  V  InField(Fvenue,p2,c) 

•  If  the  next  token  is  ‘in’  and  it  is  in  the  Venue  field,  then  the  current 
token  is  likely  in  the  Title  field 

- <Next(pl,p2 )  V  ->Token(Tin,p2,c )  V  ~iInField(Fvenue,p2,c)  V 

InField(Ftitle,  pi,  c ) 

On  the  other  hand,  when  starting  from  an  empty  knowledge  base,  OSL-M2  is 
able  to  discover  the  regularity  that  consecutive  words  are  usually  in  the  same 
fields: 
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->Next{p\,p2)  V  -i InField(Fauthor,pl,c )  V  InField(Fauthor,p2,c ) 

- <Next(pl,p2 )  V  -i InField(Ftitle,pl,c )  V  InField(Ftitle,p2,c) 
->Next(pl,p2)  V  - <InField(Fvenue,pl,  c )  V  InField(Fvenue,p2,c ) 

6.5  Related  Work 

Our  work  in  this  chapter  is  related  to  previous  work  on  online  feature 
selection  for  Markov  Random  Fields  (MRFs)  (Perkins  &  Theiler,  2003;  Zhu, 
Lao,  &  Xing,  2010).  However,  our  work  differs  in  two  aspects.  First,  this 
previous  work  assumes  all  the  training  examples  are  available  at  the  beginning 
and  only  the  features  are  arriving  online,  while  in  our  work  both  the  examples 
and  features  (clauses)  are  arriving  online.  Second,  in  previous  work,  the  new 
features  are  given,  while  in  our  work  the  new  features  are  induced  from  each 
example.  Thus,  our  work  is  also  related  to  previous  work  on  feature  induction 
for  MRFs  (Della  Pietra,  Della  Pietra,  &  Lafferty,  1997;  McCallum,  2003),  but 
those  are  batch  methods. 

The  idea  of  combining  relational  pathfinding  with  mode  declarations 
has  been  used  in  previous  work  (Ong,  de  Castro  Dutra,  Page,  &  Costa,  2005; 
Duboc,  Paes,  &  Zaverucha,  2008).  However,  how  they  are  used  is  different. 
In  (Ong  et  al.,  2005),  mode  declarations  were  used  to  transform  a  bottom 
clause  into  a  directed  hypergraph  where  relational  pathfinding  was  used  to 
find  paths.  Similarly,  in  (Duboc  et  al.,  2008),  mode  declarations  were  used 
to  validate  paths  obtained  from  bottom  clauses.  Here,  mode  declarations  are 
first  used  to  reduce  the  search  space  to  paths  that  contain  ‘input’  and  ‘output’ 
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nodes.  Then  they  are  used  to  test  whether  an  hyperedge  can  be  added  to  an 
existing  path.  Finally,  they  are  used  to  create  clauses  with  constants. 

6.6  Chapter  Summary 

In  this  chapter,  we  present  OSL,  the  first  online  structure  learner  for 
MLNs.  In  each  step,  OSL  uses  mode-guided  relational  pathfinding  to  find  new 
clauses  that  fix  the  model’s  wrong  predictions.  Experimental  results  in  field 
segmentation  on  two  real-world  datasets  show  that  OSL  is  able  to  find  new 
useful  clauses  that  improve  the  predictive  accuracies  of  well-developed  MLNs. 
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Chapter  7 


Automatically  Selecting  Hard  Constraints  to 
Enforce  when  Training  Structured  Prediction 

7.1  Introduction 

Many  real-world  applications  of  machine  learning  involve  a  mix  of  soft 
probabilistic  constraints  and  hard  logical  constraints.  For  example,  when  ex¬ 
tracting  relations  from  natural  language  sentences,  the  outputs  must  satisfy 
hard  constraints  like  “the  first  argument  of  a  liveJn  relation  must  be  a  person 
entity,  and  the  second  argument  must  be  a  location  entity,”  as  well  as  many  soft 
constraints  such  as  “the  word  ‘residence’  frequently  appears  between  the  two 
arguments  of  a  live-in  relation.”  Or  when  segmenting  bibliographic  citations, 
a  prediction  must  conform  to  the  hard  constraint  “a  Venue  token  cannot  ap¬ 
pear  before  a  Title  token,”  as  well  as  many  soft  constraints  such  as  “the  word 
’International’  is  usually  a  Venue  token.” 

In  terms  of  graphical  models,  hard  constraints  add  new  interactions  be¬ 
tween  variables,  which  increases  the  computational  complexity  of  a  problem. 
On  the  other  hand,  from  the  point  of  view  of  probabilistic  models,  hard  con¬ 
straints  introduce  deterministic  factors  into  the  model  (i.e  zero  out  some  po¬ 
tential  function  values),  which  causes  a  lot  of  troubles  to  existing  inference  and 
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learning  methods  (Poon  &  Domingos,  2006).  Thus,  finding  the  most  effective 
and  efficient  way  to  perform  inference  and  learning  for  problems  with  a  mix  of 
hard  and  soft  constraints  is  an  ongoing  research  problem  (Chang,  Ratinov,  Riz- 
zolo,  &  Roth,  2008).  Previous  work  has  explored  two  different  approaches  to 
the  learning  problem.  The  first  approach,  called  learning  plus  inference  (L+I) 
(Punyakanok,  Roth,  tau  Yih,  &  Zimak,  2005),  is  to  completely  ignore  hard 
constraints  during  training  and  only  enforce  them  at  testing  time.  At  first, 
this  approach  does  not  seem  theoretically  appealing  since  hard  constraints  are 
only  used  during  testing  and  have  no  effect  on  the  learning  process.  However, 
the  L+I  approach  allows  efficient  modular  training  of  individual  components 
that  are  only  integrated  as  needed  for  testing  and  has  achieved  significant 
successes  in  many  real-world  applications  (Roth  &  Yih,  2005,  2007).  For  ex¬ 
ample,  Punyakanok  et  al.  (2004)  and  Koomen  et  al.  (2005)  trained  classifiers 
to  independently  assign  a  semantic  role  to  each  noun  phrase  in  a  natural  lan¬ 
guage  sentence,  and  then  Integer  Linear  Programming  is  used  to  determine 
the  most  likely  set  of  global  assignments  that  satisfies  a  set  of  hard  linguis¬ 
tic  constraints.  The  second  approach,  called  inference  based  training  (IBT) 
(Punyakanok  et  ah,  2005),  includes  all  constraints  both  in  training  and  test¬ 
ing.  This  approach  is  theoretically  more  desirable  since  it  takes  into  account 
all  constraints  at  training  time  thus  can  ideally  learn  a  more  accurate  model. 
Nevertheless,  so  far  there  is  relatively  little  empirical  success  on  real-world 
problems  with  this  approach  since  enforcing  deterministic  constraints  during 
learning  typically  makes  the  training  problem  significantly  more  complex  and 
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computationally  intractable. 

In  this  chapter,  we  propose  a  new  approach  to  incorporating  declara¬ 
tive  hard  constraints  when  learning  probabilistic  models  for  structured  predic¬ 
tion.  The  key  idea  is  to  only  include  “inexpensive”  constraints  during  training 
and  only  enforce  the  remaining  “expensive”  constraints  during  testing.  Our 
new  approach,  which  we  will  call  Selectively  Constrained  Training  (SCT),  lies 
somewhere  between  the  two  extreme  approaches  reviewed  above,  attempting 
to  achieve  the  improved  accuracy  of  the  IBT  approach  while  retaining  the 
training  efficiency  of  the  L+I  approach. 

The  remainder  of  the  chapter  is  organized  as  follows.  Section  7.1  de¬ 
scribes  our  heuristic  for  selecting  which  hard  constraints  are  “inexpensive”  and 
should  be  included  in  training.  Section  7.2  presents  the  experimental  evalua¬ 
tion  of  the  proposed  approach.  Section  7.3  discusses  related  work  and  section 
7.4  summarizes  the  chapter. 

7.2  Heuristic  for  selecting  hard  constraints  to  use  in 
training 

As  previously  mentioned,  the  main  problem  with  including  hard  con¬ 
straints  during  training  is  that,  in  practice,  it  greatly  increases  the  computa¬ 
tional  complexity  of  the  learning  problem.  Examining  the  problem  further, 
we  found  that  the  main  effect  of  enforcing  hard  constraints  during  training  is 
on  the  complexity  of  the  inference  problem.  Introducing  hard  constraints  in 
training  usually  results  in  a  much  more  complex  inference  problem  which  can- 
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not  be  solved  efficiently  in  most  cases.  Therefore,  it  significantly  impacts  the 
training  process  since  we  need  to  solve  these  complex  inference  problems  many 
times  during  training.  So  our  idea  is  to  only  include  hard  constraints  in  train¬ 
ing  when  they  do  not  significantly  increase  the  complexity  of  the  underlying 
inference  problem.  We  call  such  hard  constraints  “inexpensive”.  A  standard 
metric  for  measuring  the  complexity  of  an  inference  problem  is  the  tree-width 
of  the  graphical  structure  (Roller  &  Friedman,  2009).  However,  computing 
the  tree-width  of  a  graph  is  an  NP-hard  problem  in  general  (Arnborg,  Cornell, 
&  Proskurowski,  1987).  There  are  methods  for  approximating  the  tree  width 
(Roller  &  Friedman,  2009),  but  it  is  still  computationally  expensive  for  SRL 
formalisms  since  we  need  to  construct  the  ground  networks  and  compute  the 
approximate  tree-width  for  each  possible  subset  of  the  hard  constraints.  We 
now  describe  a  simple  and  efficient  heuristic  for  detecting  “inexpensive”  hard 
constraints. 

From  the  point  of  view  of  the  resulting  graphical  model,  each  hard 
constraint  defines  a  graphical  structure  among  the  output  variables.  So  we 
define  an  “inexpensive”  hard  constraint  in  terms  of  how  its  addition  to  the 
graphical  model  affects  the  efficiency  of  inference.  Since  all  graphical  models 
can  be  converted  into  factor  graphs,  we  analyze  the  complexity  of  a  constraint 
based  on  its  factor  graph  representation  (Rschischang,  Frey,  &  Loeliger,  2001). 
From  the  perspective  of  factor  graphs,  a  hard  constraint  introduces  new  factors 
into  the  original  problem.  Intuitively,  the  denser  the  factor  graph  is,  the 
harder  the  inference  problem  is  since  a  dense  factor  graph  tends  to  have  high 
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tree-width.  For  a  given  problem,  the  number  of  random  variables,  i.e.  the 
number  of  nodes  in  its  factor  graph,  is  a  fixed  number,  so  the  denseness  of 
the  factor  graph  is  determined  by  the  number  of  factors  and  the  number  of 
edges.  Therefore,  we  measure  the  complexity  of  a  hard  constraint  based  on 
the  number  of  factors  added  by  the  constraint  and  the  degree  (the  number  of 
edges  connecting  to  nodes  or  the  number  of  involved  variables)  of  the  created 
factors.  A  well-known  result  is  that  if  the  graph  is  a  linear  chain  then  inference 
is  efficient  (Sutton  &  McCallum,  2007).  Looking  at  the  factor  graph  of  a  linear 
chain,  we  see  that  it  has  n  nodes  (one  for  each  output  variable),  n  —  1  factors 
(one  for  each  pair  of  adjacent  nodes),  and  each  factor  has  a  degree  of  two.  So 
for  linear  chains,  the  relationship  between  the  created  factors  and  number  of 
nodes  is  linear,  2  *  {n  —  1)  and  n.  Based  on  this  observation,  we  propose  the 
following  heuristic  for  detecting  “inexpensive”  hard  constraints: 

Definition.  An  “inexpensive”  hard  constraint  is  one  that  creates  a 
graphical  structure  in  which  the  number  of  factors  times  the  degree  of  each 
factor  is  linear  in  its  number  of  nodes. 

For  an  MLN,  this  heuristic  is  easily  implemented.  The  number  of  nodes 
created  by  a  clause  is  the  total  number  of  unique  ground  literals  of  its  query 
predicates.  The  number  of  factors  that  a  clause  creates  is  its  number  of  unique 
ground  clauses.  The  degree  of  each  created  factor  is  the  number  of  appear¬ 
ances  of  query  predicates  in  the  clause.  Therefore,  based  on  examining  these 
quantities,  it  can  be  automatically  decided  whether  or  not  a  hard  clause  is 
“inexpensive.”  Below  are  examples  of  “inexpensive”  and  “expensive”  hard 


105 


constraints  in  MLNs  for  the  task  of  citation  segmentation: 


•  An  “inexpensive”  constraint:  a  Venue  token  cannot  appear  right 
after  an  Author  token 

Next(pl,p2)  A  InField(Fauthor,pl,c )  =>■  -> InField(Fvenue,p2,  c ). 

This  constraint  satisfies  the  above  criterion  since  the  number  of  generated 
ground  clauses  is  n  —  1,  there  are  two  appearances  of  the  query  predicate 
InField  in  each  ground  clause,  and  the  total  number  of  unique  ground 
literals  is  2 n  where  n  is  the  number  of  possible  positions  (i.e.  tokens)  in 
the  citation  to  be  segmented. 

•  An  “expensive”  constraint:  a  Venue  token  cannot  appear  before  an 
Author  token 

LessThan(pl,p2)  A  InField(Fauthor,p2,c)  =>■ 
-iInField(Fvenue,pl,  c ). 

This  constraint  does  not  satisfy  the  above  criterion  since  the  relationship 
between  the  number  of  generated  ground  clauses  and  the  total  number 
of  unique  ground  literals  is  quadratic  (n*  (n  —  l)/2  ground  clauses  and 
2 n  ground  literals.) 

In  the  SCT  approach,  only  the  inexpensive  constraints  are  used  when  training 
the  weights  of  the  soft  clauses  in  the  model.  This  ensures  that  inference,  and 
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therefore  training,  is  reasonably  efficient.  During  testing,  all  hard  clauses  and 
soft  clauses  with  learned  weights  are  used  when  making  predictions.  This  helps 
ensures  accurate  predictions  for  test  cases. 


7.3  Experimental  Evaluation 

7.3.1  Data 

In  order  to  empirically  evaluate  SCT  and  compare  it  to  L+I  and  IBT, 
we  ran  experiments  on  two  standard  bibliographic  citation  datasets:  CiteSeer 
described  in  4.4.1  and  Cora  (Bilenko  &  Mooney,  2003). 

The  task  is  to  segment  each  citation  into  three  fields:  Author,  Title  and 
Venue.  There  are  1,  563  and  1, 295  citations  in  CiteSeer  and  Cora  respectively. 
The  CiteSeer  dataset  has  four  independent  subsets  consisting  of  citations  in 
four  different  research  areas.  There  are  3  disjoint  subsets  of  citations  in  Cora. 

7.3.2  Hard  Constraints 

We  used  the  MLN  for  isolated  segmentation  model  described  in  6.4.2 
as  the  base  MLN.  It  has  one  hard  clause  imposing  the  constraint  that  a  token 
can  only  be  part  of  at  most  one  held  (mutual  exclusivity): 

InField(fl,p,c )  A  (/ 1!  =  /2)  -> InField(f2,p,c ). 

When  analyzing  the  prediction  errors  of  the  isolated  segmentation  model 
including  this  constraint,  we  noticed  these  types  of  recurring  errors: 
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•  Violations  of  the  purity  of  each  held.  For  example,  we  found  Title  or 
Venue  tokens  between  two  Author  tokens  in  many  examples. 

•  Violations  of  the  typical  order  of  field  appearances  such  as  the  Venue 
field  appearing  between  the  Author  and  Title  field. 

To  correct  these  errors,  we  introduced  the  following  new  hard  constraints: 

•  Continuity  constraint  (Cl):  any  tokens  between  two  tokens  of  the  same 
field  must  also  belong  to  that  field. 

LessThan(pl,p2 )  A  LessThan(p2,p3 )  A  IiiField(+f,pl,c )  A 
InField(+f,p3,c )  InField(+f,p2,c). 

•  Constraints  on  the  order  of  held  appearance. 

—  C2:  a  Title  token  cannot  appear  before  an  Author  token 

LessThan(pl,p2 )  A  I nField(F author, p2,c) 

-i InField(Ftitle,  pi ,  c). 

—  C3:  a  Venue  token  cannot  appear  before  an  Author  token 

LessThan(pl,p2 )  A  I nField(F author, p2,c) 

~^InField{F venue, pi,  c). 

—  C4:  a  Venue  token  cannot  appear  before  a  Title  token 

LessThan(pl,p2 )  A  InField(Ftitle,p2,c) 
~^InField{Fvenue,pl ,  c). 
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—  C5:  a  Venue  token  cannot  appear  right  after  an  Author  token 

Next(pl,p2)  A  I nField(F author,  pi,  c)  =>■ 

-i I nField(F venue,  p2,  c ). 

According  to  the  heuristic  described  in  previous  section,  only  the  mutually 

exclusive  constraint  and  constraint  C5  are  “inexpensive”  constraints,  the  rest 

are  “expensive”  ones. 

7.3.3  Methodology 

We  evaluated  the  following  systems: 

No  constraints  :  Train  the  base  MLN,  which  is  the  isolated  segmentation 
model  without  the  mutual  exclusivity  constraint,  and  use  the  learned 
MLN  for  testing. 

L+I  :  Like  the  previous  system,  but  include  all  of  the  hard  constraints  above 
during  testing. 

IBT  :  Train  a  model  that  includes  clauses  from  the  base  MLN  as  well  as  all 
of  the  hard  constraints  above,  and  then  use  the  learned  MLN  with  all 
the  constraints  for  testing. 

IBT-Approx  :  Like  IBT  but  use  approximate  inference  instead  of  exact  in¬ 
ference  in  training. 

SCT  :  Use  the  heuristic  described  in  previous  section  to  automatically  select 
the  “inexpensive”  hard  constraints  from  the  hard  constraints  above  and 
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only  include  them  when  training,  and  then  include  all  constraints  during 
testing. 

SCT—  :  Like  SCT  but  does  not  include  “expensive”  constraints  in  testing. 
In  other  words,  this  is  IBT  without  “expensive”  constraints. 

To  train  all  of  the  systems,  we  used  the  CDA-ML  algorithm  described 
in  section  5.2.  We  used  Hamming  loss  as  the  label  loss  function  and  set  the 
value  of  cr  to  0.0001  for  CiteSeer  and  0.002  for  Cora.1  We  ran  one  pass  over 
the  training  set  and  used  the  average  weight  vector  for  making  predictions  on 
the  test  set.  For  inference,  we  used  the  procedure  described  in  section  4.3.1  to 
translate  the  MPE  inference  problem  into  an  ILP.  An  ILP  solver  was  then  used 
for  exact  inference,  and  the  ILP  was  relaxed  to  a  Linear  Program  (LP)  and  an 
LP  solver  used  for  approximate  inference.2  Only  exact  inference  guarantees 
that  inference  results  satisfy  all  hard  constraints.  For  all  systems,  we  ran  4-fold 
cross-validation  on  CiteSeeer  and  3-fold  cross-validation  on  Cora. 

Like  previous  work,  we  used  the  F\  at  the  token  level,  i^-token,  to 
measure  the  segmentation  accuracy  of  all  systems.  However,  the  Fi-token 
only  captures  the  local  performance,  which  is  not  the  best  metric  for  mea¬ 
suring  the  effect  of  hard  constraints  which  enforce  global  properties  of  the 
complete  segmentation.  To  better  measure  the  effect  of  hard  constraints,  we 
also  computed  the  F\  at  the  held  level,  F\-field,  (all  tokens  in  a  held  must  be 

lrThe  value  of  a  was  set  based  on  predictive  performance  on  the  training  set. 

2We  used  lp_solve  (http://lpsolve.sourceforge.net)  for  ILP  and  Mosek  (http:// 
www.mosek.com)  for  LP. 
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Table  7.1:  Performance  of  different  systems  on  CiteSeer.  Results  are  averaged 
over  4  folds.  Results  of  the  proposed  approach  are  shown  in  bold. 


Tf-token 

Fi-field 

Citation 

accuracy 

Average  trainining 
time  (minutes) 

No  constraints 

93.53  ±  2.63 

84.41  ±  8.63 

37.25  ±  12.50 

12.1 

L+I 

95.10  ±  2.51 

85.85  ±  9.61 

63.30  ±  22.59 

12.1 

IBT 

95.53  ±  2.04 

88.00  ±  6.75 

66.83  ±  17.95 

322.8 

IBT-Approx 

90.37  ±  3.73 

69.48  ±  15.55 

20.45  ±  15.04 

64.4 

SCT 

95.64  ±  2.05 

87.97  ±  6.85 

66.72  ±  18.72 

37.4 

SCT— 

94.61  ±  1.95 

84.70  ±  7.91 

43.24  ±  10.37 

37.4 

Table  7.2:  Performance  of  different  systems  on  Cora.  Results  are  averaged 


over  3  folds.  Results  of  the  proposed  approach  are  shown  in  bold. 


-token 

F\ -field 

Citation 

accuracy 

Average  training 
time  (minutes) 

No  constraints 

96.46  ±  0.45 

92.58  ±  3.08 

55.91  ±  3.96 

2.6 

L+I 

98.88  ±  0.21 

93.74  ±  0.86 

81.30  ±  1.86 

2.6 

IBT 

98.92  ±  0.40 

95.63  ±  1.74 

84.45  ±  7.50 

266.3 

IBT-Approx 

87.06  ±  1.98 

62.12  ±  4.69 

13.61  ±  5.15 

87.2 

SCT 

98.98  ±  0.49 

96.04  ±  1.83 

85.39  ±  7.08 

22 

SCT- 

98.10  ±  0.90 

94.73  ±  2.08 

71.40  ±  9.51 

22 

correctly  assigned  the  right  labels  in  order  to  count  as  a  correct  field)  and  cita¬ 
tion  accuracy,  the  proportion  of  citations  in  which  all  tokens  are  assigned  the 
correct  labels,  i.e.  the  percentage  of  citations  segmented  completely  correctly. 

7.3.4  Results  and  Discussion 

Table  7.1  and  Table  7.2  show  the  performance  of  difference  systems 
on  CiteSeer  and  Cora  respectively.  The  first  three  columns  show  how  adding 
hard  constraints  improves  segmentation  accuracy.  All  of  the  systems  with  con¬ 
straints  except  IBT-Approx  are  significantly  better  than  the  one  without  them, 
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especially  in  terms  of  citation  accuracy.  For  example,  the  citation  accuracies 
of  1BT  and  SCT  are  30%  higher  than  that  of  “No  constraints”  on  average. 
The  reason  for  the  huge  jump  in  the  citation  accuracy  is  due  to  the  fact  that 
without  constraints  there  are  many  examples  having  only  a  few  mislabeled 
tokens.  Among  the  systems  using  constraints,  the  ones  that  are  trained  using 
constraints  except  IBT-Approx  have  higher  segmentation  accuracy  than  the 
one  that  only  enforces  constraints  during  testing  (L+I).  On  the  other  hand, 
the  significant  difference  between  the  citation  accuracy  of  SCT—  and  that  of 
SCT  shows  the  usefulness  of  the  expensive  hard  constraints  (Cl  to  C4). 

Therefore,  including  hard  constraints  during  learning  improves  the  ac¬ 
curacy  of  the  learned  model  with  exact  inference.  However,  the  last  column  of 
Table  7.1  and  7.2  show  that  naively  including  all  hard  constraints  in  training 
results  in  a  huge  increase  in  training  time.  Training  for  the  1BT  approach 
takes  26.7  times  longer  than  the  L+I  approach  on  CiteSeer,  and  an  order  of 
magnitude  longer  on  Cora.  IBT-Approx  results  show  that  using  approximate 
inference  in  training  reduces  training  time  but  also  significantly  decreases  ac¬ 
curacy.  However,  using  the  heuristic  described  in  section  7.2,  SCT  takes  only 
3  times  longer  to  train  than  L+I  and  is  8.6  times  faster  than  IBT  on  CiteSeer, 
and  8.5  times  longer  than  L+I  and  12  times  faster  than  IBT  on  Cora,  while 
still  matching  IBT’s  accuracy. 

IBT-Approx  has  the  worst  accuracy,  which  may  be  clue  to  the  combina¬ 
tion  of  online  learning  and  approximate  inference,  since  the  same  approximate 
inference  method  has  shown  good  performance  in  chapter  4.  The  performance 
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of  IBT-Approx  may  be  improved  by  using  the  latest  methods  for  learning  with 
approximate  inference  (Meshi,  Sontag,  Jaakkola,  &  Globerson,  2010;  Martins, 
Smith,  Xing,  Aguiar,  &  Figueiredo,  2010;  Koo,  Rush,  Collins,  Jaakkola,  & 
Sontag,  2010). 

7.4  Related  Work 

Learning  and  inference  with  constraints  has  been  studied  in  several  pre¬ 
vious  papers  (Punyakanok  et  al.,  2005;  Tromblc  &  Eisner,  2006;  Chang  et  ah, 
2008).  The  first  comprehensive  study  of  the  issue  of  learning  and  inference  over 
constrained  output  was  conducted  by  Punyakanok  et  al.  (2005).  The  L+I  and 
IBT  approaches  were  defined  in  that  work,  and  experiments  on  both  synthetic 
and  real-world  data  were  conducted  to  compare  their  predictive  performance. 
The  authors  also  presented  some  theoretical  results  on  the  predictive  perfor¬ 
mance  of  the  L+I  and  IBT  approaches.  However,  they  did  not  look  at  the  case 
of  only  enforcing  some  constraints  during  training.  In  followup  work,  Chang  et 
al.  (2008)  proposed  Constrained  Conditional  Models  (CCMs),  an  extension  of 
linear  models  for  combining  probabilistic  models  with  declarative  constraints. 
A  CCM  has  two  weight  vectors:  one  for  features  and  one  for  constraints.  Un- 
like  MLNs,  CCMs  separate  constraints  from  features,  thus  their  weights  are 
learned  in  different  manners.  When  all  the  constraints  are  hard  constraints, 
then  a  CCM  can  be  represented  by  an  MLN  where  the  soft  clauses  in  the  MLN 
are  the  features  and  the  hard  clauses  are  the  constraints  in  the  corresponding 
CCM.  Like  Punyakanok  et  al.  (2005),  the  authors  only  looked  at  two  ways  to 
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train  a  CCM:  the  L+I  and  IBT  approaches. 

By  only  enforcing  “inexpensive”  constraints  in  training,  our  work  shares 
the  goal  of  piecewise  training  (Sutton  &  McCallum,  2009),  which  is  “to  perform 
less  inference  at  training  time  than  at  test  time.”  The  key  idea  of  piecewise 
training  is  to  break  the  original  complex  problem  into  smaller  pieces,  train 
these  pieces  independently,  and  then  combine  them  at  testing  for  prediction. 
In  this  sense,  piecewise  training  is  similar  to  an  extreme  version  of  the  L+I 
approach  (Chang  et  al.,  2008)  where  no  interaction  between  output  variables 
is  learned  during  training  and  constraints  are  only  used  in  testing  to  capture 
those  relationships.  So  far,  piecewise  training  has  only  been  applied  in  training 
structured  prediction  models  without  hard  constraints.  It  would  be  interesting 
to  see  how  well  it  performs  when  there  are  hard  constraints  in  the  models. 

Another  line  of  research  on  speeding  up  the  learning  process  for  complex 
problems  is  learning  with  approximate  inference  (Martins,  Smith,  &  Xing, 
2009;  Meshi  et  ah,  2010;  Martins  et  ah,  2010;  Koo  et  ah,  2010).  It  would 
be  interesting  to  see  how  well  these  methods  help  improve  the  accuracy  of 
IBT-Approx. 

7.5  Chapter  Summary 

We  have  presented  a  new  approach  to  incorporating  declarative  hard 
constraints  into  learning  models  for  structured  prediction.  The  idea  is  to  only 
enforce  “inexpensive”  hard  constraints  during  training  that  do  not  inordinately 
increase  the  computational  complexity  of  inference.  We  also  proposed  a  simple 
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heuristic  for  detecting  “inexpensive1'  hard  constraints  in  Statistical  Relational 
Learning  frameworks  like  Markov  Logic  that  represent  models  as  templates  for 
constructing  graphical  models.  The  proposed  approach  was  applied  to  MLNs 
on  the  task  of  bibliographic  citation  segmentation.  Experimental  results  show 
that  the  new  approach  achieves  the  best  predictive  accuracy  while  still  allowing 
for  efficient  training. 


115 


Chapter  8 


Future  Work 


This  chapter  discusses  some  ways  in  which  the  contributions  of  this 
thesis  can  be  extended. 

8.1  Online  Max-Margin  Weight  Learning 

As  mentioned  earlier,  the  CDA  algorithm  developed  in  chapter  5  can 
be  applied  to  other  structured  prediction  models.  So  it  would  be  interesting 
to  apply  CDA  to  models  such  as  M3Ns  (Taskar  et  al.,  2004).  On  the  other 
hand,  currently  CDA  assumes  that  the  data  are  fully  observable.  However, 
there  are  problems  in  which  some  information  is  not  observable.  For  example, 
in  plan  recognition,  we  usually  only  observe  the  actions  and  the  top-level 
plans  not  the  intermediate  plans  (Blaylock  &  Allen,  2005).  Existing  work  has 
developed  a  max-margin  approach  for  learing  with  partially  observable  data 
(Yu  &  Joachims,  2009),  but  the  method  is  for  a  batch  setting.  So  it  would  be 
interesting  to  extend  CDA  to  the  case  of  learing  with  partially  observable  data. 
Another  venue  for  future  work  is  to  derive  Coordinate-Dual- Ascent  algorithms 
for  ^-regularized  max-margin  structured  prediction. 
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8.2  Online  Structure  Learning 

In  chapter  6,  we  presented  OSL,  the  first  online  structure  learning  for 
MLNs.  Since  this  is  the  initial  step,  there  is  still  a  lot  of  room  for  improvement. 
First,  the  mode  declarations  provide  a  simple  way  to  restrict  the  search  space 
for  paths  but  there  are  some  types  of  constraints  that  cannot  be  expressed  by 
mode  declarations  such  as  the  constraint  that  the  predicate  PI  and  predicate 
P2  should  not  appear  in  the  same  path.  So  it  may  be  useful  to  use  some  other 
forms  of  language  biases  that  are  more  expressive.  In  addition,  OSL  currently 
does  not  use  clauses  in  the  existing  MLN  to  restrict  the  search  space.  So  it 
would  be  useful  to  exploit  this  information.  Second,  OSL,  especially  OSL-M1, 
currently  adds  a  lot  of  new  clauses  at  each  step,  which  significantly  increases 
the  computational  cost.  So  it  would  be  useful  to  develop  a  better  criterion  for 
selecting  fewer  but  useful  clauses  at  each  step.  On  the  other  hand,  besides  l\- 
regularization,  there  are  also  other  methods  for  inducing  sparse  models  such  as 
greedy  methods  (Zhang,  2009).  Thus,  it  would  be  interesting  to  explore  those 
methods  in  order  to  reduce  the  number  of  clauses  learnt  by  OSL.  Finally, 
relational  pathfinding  is  just  one  way  to  learn  clauses  from  data,  and  there  are 
some  types  of  clauses  that  cannot  be  learnt  by  relational  pathfinding  such  as 
clauses  containing  non-relational  literals.  Hence,  it  may  be  useful  to  combine 
relational  pathfinding  with  other  search  methods. 
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8.3  Other  Issues  in  Online  Learning 

In  this  thesis,  we  only  considered  one  extreme  setting  in  online  learning 
where  the  learner  only  uses  the  current  example  for  updating  the  model  at 
each  step  and  only  runs  one  pass  over  examples.  So  it  would  be  interesting 
to  explore  other  settings  such  as  running  multiple  passes  (or  epochs)  over 
examples  or  maintaining  a  window  of  examples.  It  would  also  be  useful  to 
study  the  sensitivity  of  the  online  algorithms  developed  in  chapter  5  and  6  to 
the  order  of  examples. 

8.4  Discriminative  Learning  with  Large  Mega- Examples 

In  this  thesis,  we  addressed  one  dimension  of  the  issue  of  scalability  in 
discriminative  learning  for  MLNs  when  the  number  of  examples  is  increasing. 
However,  the  issue  of  scalability  also  arises  when  the  size  of  an  example  is 
getting  bigger.  For  instance,  in  social  network  analysis  (Backstrom  &  Leskovec, 
2011),  each  example  is  a  huge  network.  In  order  to  use  the  max-margin  weight 
learning  methods  developed  in  chapters  4  and  5  on  these  large  mega-examples, 
one  needs  to  develop  efficient  MPE  inference  methods  for  large  graphs.  For 
instance,  one  can  use  the  approach  described  by  Singla  and  Domingos  (2008) 
to  lift  the  max-product  algorithm  (Pearl,  1988),  a  widely  used  approximation 
algorithm  for  MPE  inference  in  Markov  network.  For  discriminative  structure 
learning  on  large  mega-examples,  one  can  adapt  the  motif  identification  step 
in  LSM  to  only  search  for  motifs  that  contain  the  query  predicates,  then  uses 
the  mode-guided  relational  pathfinding  developed  in  chapter  6  to  search  for 
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paths  in  each  motif.  In  addition,  another  important  line  of  research  is  to 
develop  online  learning  methods  for  the  case  where  the  example  is  a  single 
huge  network  that  is  changing  over  time. 

8.5  Improving  Scalability  through  Parallelism 

In  this  thesis,  we  improve  the  scalability  of  discriminative  learning 
methods  for  MLNs  through  online  learning.  However,  another  way  to  speed 
up  existing  discriminative  learning  methods  for  MLNs  is  to  make  them  par¬ 
allel.  For  instance,  one  may  use  GraphLab  (Low,  Gonzalez,  Kyrola,  Bickson, 
Guestrin,  &  Hellerstein,  2010),  a  recently  proposed  parallel  framework  for  ma¬ 
chine  learning  algorithms,  to  speed  up  existing  batch  learning  methods  for 
MLNs.  In  addition,  the  online  algorithms  presented  in  chapter  5  and  6  can 
also  be  sped  up  by  computing  the  subgradients  and  performing  the  weight’s 
updates  in  parallel. 

8.6  Learning  with  Hard  Constraints 

In  the  previous  chapter,  we  only  experimentally  evaluated  the  SCT  ap¬ 
proach  with  one  learning  algorithm.  So,  it  would  be  useful  to  test  the  SCT 
approach  with  other  learning  algorithms.  Additionally,  given  the  encouraging 
results  on  citation  segmentation,  an  obvious  area  for  future  research  is  evalu¬ 
ating  the  selective  enforcement  of  constraints  when  training  models  for  other 
real-world  applications  that  involve  a  lot  of  hard  constraints  such  as  entity  and 
relation  extraction.  On  the  other  hand,  SCT  could  be  applied  to  select  inex- 
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pensive  constraints  in  other  SRL  formalisms  that  use  expressions  in  logical, 
relational-database,  or  object-oriented  languages  as  templates  for  constructing 
graphical  models,  such  as  BLPs  (Kersting  &  De  Raedt,  2001),  PRMs  (Getoor, 
Friedman,  Roller,  &  Pfeffer,  2001),  RMNs  (Taskar  et  ah,  2002)  and  FACTO- 
RIE  (McCallum,  Schultz,  &  Singh,  2009).  Given  the  success  with  MLNs,  it 
would  be  interesting  to  see  how  well  SCT  perform  on  those  formalisms. 

Currently,  SCT  only  addresses  the  computational  aspect  of  hard  con¬ 
straints.  However,  another  important  aspect  of  hard  constraints  is  whether 
they  are  helpful  or  not.  So  it  would  be  useful  to  extend  SCT  to  take  into 
account  the  usefulness  of  hard  constraints. 

8.7  Other  Applications 

In  previous  chapters,  we  have  applied  our  new  algorithms  to  some  real- 
world  structured  prediction  problems  that  involve  data  with  thousands  of  ex¬ 
amples  such  as  natural  language  held  segmentation,  semantic  role  labeling, 
and  web  search.  There  are  many  more  real-world  problems  that  have  similar 
characteristics.  For  examples,  many  problems  in  computer  vision  involve  data 
with  thousands  of  images  where  each  image  is  an  example.  The  task  may  be 
to  segment  an  image  into  different  regions  or  to  recognize  all  the  objects  and 
their  interactions  in  a  given  image,  etc.  Thus,  it  would  be  interesting  to  apply 
the  methods  developed  in  this  thesis  to  those  problems. 

In  addition,  the  online  learning  algorithms  developed  in  chapter  5  and  6 
fit  nicely  to  many  problems  in  social  media  where  the  data  arrive  in  a  streaming 
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order.  So  it  would  be  interesting  to  apply  those  methods  to  problems  with 
streaming  data. 
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Chapter  9 


Conclusion 


The  research  presented  in  this  thesis  addresses  two  important  issues  in 
discriminative  learning  for  MLNs:  accuracy  and  scalability. 

We  first  presented  a  new  method  that  discriminatively  learns  both  the 
structure  and  parameters  for  a  special  class  of  MLNs  where  all  the  clauses 
are  non-recursive  ones  which  allow  efficient  exact  inference.  The  proposed 
approach  is  a  two-step  process.  The  first  step  uses  Aleph  to  generate  a  large 
set  of  potential  clauses.  The  second  step  learns  the  weights  for  these  clauses, 
preferring  to  eliminate  useless  clauses  by  giving  them  zero  weight  by  using  ly- 
regularization.  The  new  method  outperforms  existing  MLN  and  1LP  methods 
and  achieves  state-of-the-art  accuracies  on  the  Alzheimer’s-drug  benchmarks. 

To  further  improve  the  predictive  accuracy,  we  proposed  a  new  ap¬ 
proach  to  learning  weights  for  MLNs,  which  aims  to  maximize  the  separation 
margin  instead  of  the  conditional  likelihood  of  the  training  data.  In  order  to 
solve  the  max-margin  optimization  problem,  we  developed  a  new  approximate 
algorithm  for  loss-augmented  MPE  inference  in  MLN  based  on  LP-relaxation. 
The  max-margin  weight  learner  generally  has  better  or  equally  good  but  more 
stable  predictive  accuracy  than  existing  discriminative  weight  learning  meth- 
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ods  for  MLNs. 


Then,  to  make  the  max-margin  method  more  scalable,  we  derived  CDA, 
an  online  algorithm  for  max-margin  structured  prediction,  from  the  primal- 
dual  framework.  We  applied  CDA  to  learn  weights  for  MLNs  on  problems 
with  thousand  of  examples  where  existing  batch  learning  methods  for  MLNs 
cannot.  Experimental  results  on  several  large-scale  real-world  problems  show 
that  CDA  generally  achieves  better  accuracy  than  existing  online  methods  for 
structured  prediction.  In  particular,  CDA  is  more  accurate  and  robust  on 
noisy  datasets. 

However,  like  other  existing  online  algorithms,  CDA  assumes  the  input 
MLN’s  structure  is  complete  and  only  updates  the  weights.  But,  the  input 
structure  is  usually  incomplete  in  practice,  so  it  should  be  also  updated.  To 
address  this  issue,  we  developed  OSL,  the  first  algorithm  that  performs  both 
online  structure  and  parameter  learning.  To  End  new  clauses  at  each  step,  we 
introduced  mode-guided  relational  pathfinding  which  use  mode  declarations 
to  constrain  the  search  of  relational  pathfinding  in  a  novel  way.  Experimental 
results  in  field  segmentation  on  two  real-world  datasets  show  that  OSL  is 
able  to  find  new  useful  clauses  that  improve  the  predictive  accuracies  of  well- 
developed  MLNs. 

In  the  final  part  of  the  thesis,  we  addressed  the  problem  of  learning  with 
a  mix  of  hard  and  soft  constraints  which  arises  in  many  real-world  problems. 
Based  on  first-order  logic,  MLNs  provides  a  convenient  way  to  encode  both  soft 
and  hard  constraints.  However,  the  training  problem  becomes  more  compu- 
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tational  expensive  clue  to  the  complexity  introduced  by  the  hard  constraints. 
To  address  this  issue,  we  proposed  SCT,  a  simple  heuristic  for  automatically 
selecting  which  hard  constraints  should  be  included  during  training.  On  the 
task  of  bibliographic  citation  segmentation,  SCT  achieves  better  accuracy  than 
existing  methods  for  learning  with  hard  constraints,  while  still  allows  efficient 
training. 

Overall,  the  work  in  this  thesis  have  led  to  progress  on  discriminative 
learning  for  MLNs.  Since  many  real-world  problem  are  discriminative  and 
involve  noisy  structured  data  with  a  lot  of  examples,  our  work  have  provided 
more  accurate  and  scalable  methods  for  solving  those  problems. 
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